60 minute blitz uses stacked Dense layers with no activation function · FluxML/model-zoo#339

Repository metrics

Stars: (934 stars)
PR merge metrics: (Avg merge 384d 21h) (2 merged PRs in 30d)

Description

In the 60 minute blitz tutorial, we use a sequence of stacked Dense layers, each with no activation function. This doesn't make much sense, as multiple linear operators can always be combined down into a single linear operator:

julia> using Flux
       model = Chain(
           Dense(200, 120, bias=false),
           Dense(120, 84, bias=false),
           Dense(84, 10, bias=false),
       )

       model_condensed = Chain(
           Dense(model[3].W * model[2].W * model[1].W),
       )

       x = randn(200)
       sum(abs, model(x) .- model_condensed(x))
2.4189600187907168e-6

While yes, there are machine precision/rounding issues that cause it to not be exactly equivalent, you don't get any material benefit from multiple stacked Dense layers, and in fact you get a performance penalty due to the same values moving in and out of CPU cache.

It would be better to either add nonlinearities between these Dense layers to increase model flexibility, or replace them with a single Dense layer that directly drops from rank 200 to 10.

Contributor guide

Research direction: Replace the stacked Dense layers with no activation in the 60 minute blitz tutorial by either adding activation functions (e.g., relu) or merging them into a single Dense layer for better performance and clarity.
Tech stack: julia
Domain: documentation
Issue type: Documentation
Difficulty: 2
Estimated time: Under 1 hour
Activity status: Active
Clarity: Clear
Prerequisites: JuliaFluxBasic linear algebra
Newbie friendliness: 80

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.