Performance of highly skewed multidimensional reductions
#140 opened on Dec 19, 2013
Description
Performance of multidimensional reductions is not good when the array is highly skewed. For example, a fold where the number of columns is (innermost dimension) is very small. See also this thread:
https://groups.google.com/forum/#!topic/accelerate-haskell/KAFYUz4Sjsk
Multidimensional reduction uses one thread block per reduction; so an (Z :. m :. n) sized matrix uses m thread blocks. If n is very small, then many threads in the block sit idle. We could change this to a warp-per-reduction style, which is actually the strategy segmented fold uses. This will likely have a negative impact if m is small and n large.
It would be possible to generate both variants and choose dynamically which to execute. That implies compiling four kernels per reduction (because fusion; initial vs. recursive step).