60 minute blitz uses stacked Dense layers with no activation function · FluxML/model-zoo#339

仓库指标

Star: (934 star)
PR 合并指标: (平均合并 384天 21小时) (30 天内合并 2 个 PR)

描述

In the 60 minute blitz tutorial, we use a sequence of stacked Dense layers, each with no activation function. This doesn't make much sense, as multiple linear operators can always be combined down into a single linear operator:

julia> using Flux
       model = Chain(
           Dense(200, 120, bias=false),
           Dense(120, 84, bias=false),
           Dense(84, 10, bias=false),
       )

       model_condensed = Chain(
           Dense(model[3].W * model[2].W * model[1].W),
       )

       x = randn(200)
       sum(abs, model(x) .- model_condensed(x))
2.4189600187907168e-6

While yes, there are machine precision/rounding issues that cause it to not be exactly equivalent, you don't get any material benefit from multiple stacked Dense layers, and in fact you get a performance penalty due to the same values moving in and out of CPU cache.

It would be better to either add nonlinearities between these Dense layers to increase model flexibility, or replace them with a single Dense layer that directly drops from rank 200 to 10.

贡献者指南

研究方向: 在60分钟闪电战教程中，将没有激活函数的堆叠Dense层替换为添加激活函数（例如relu）或合并为单个Dense层，以提高性能和清晰度。
技术栈: julia
领域: documentation
议题类型: 文档
难度: 2
预计时间: 1 小时以内
活动状态: 活跃
清晰度: 清晰
前置要求: JuliaFluxBasic linear algebra
新手友好度: 80

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。