AccelerateHS/accelerate

Performance of highly skewed multidimensional reductions

Open

#140 opened on Dec 19, 2013

View on GitHub
 (7 comments) (0 reactions) (0 assignees)Haskell (830 stars) (112 forks)batch import
good first issuehelp wantedllvm-ptx

Description

Performance of multidimensional reductions is not good when the array is highly skewed. For example, a fold where the number of columns is (innermost dimension) is very small. See also this thread:

https://groups.google.com/forum/#!topic/accelerate-haskell/KAFYUz4Sjsk

Multidimensional reduction uses one thread block per reduction; so an (Z :. m :. n) sized matrix uses m thread blocks. If n is very small, then many threads in the block sit idle. We could change this to a warp-per-reduction style, which is actually the strategy segmented fold uses. This will likely have a negative impact if m is small and n large.

It would be possible to generate both variants and choose dynamically which to execute. That implies compiling four kernels per reduction (because fusion; initial vs. recursive step).

Contributor guide