[Performance][Sampling] Parallelize CSRSliceRows() on the CPU with multiple threads. · dmlc/dgl#3400

Repository metrics

Stars: (12,665 stars)
PR merge metrics: (No merged PRs in 30d)

Description

🚀 Feature

The function CSRSliceRows() on the CPU currently is not parallelized (https://github.com/dmlc/dgl/blob/master/src/array/cpu/spmat_op_impl_csr.cc#L361), and as a result makes MultiLayerFullNeighborSampler quite slow.

Motivation

It's currently faster to use a MultiLayerNeighborSampler with fanouts equal to the maximum degree in the graph (when memory is sufficient), than to use MultiLayerFullNeighborSampler, despite the fact that no selection or random number generation needs to be performed. As full neighbor sampling is quite slow to begin with, this is problematic.

When sampling on the CPU and performing to_block() on the GPU, no sampling workers can be used, and this lack of parallelism hurts performance quite a bit.

Pitch

It could be parallelized similar to uniform sampling https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L72, with the caveat that we would need to wait until the global_prefix is calculated (https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L147), before allocating the output arrays in order to know the total number of edges in the subgraph.

Contributor guide

Research direction: Inspect the existing CSRSliceRows() implementation and the parallel uniform sampling in rowwise pick.h. Implement parallelization with OpenMP, ensuring that global prefix sum is computed before allocating output arrays.
Tech stack: cpppython
Domain: performancebackend
Issue type: Performance
Difficulty: 3
Estimated time: 3-5 days
Activity status: Fresh
Clarity: Mostly clear
Prerequisites: C++OpenMPGraph algorithms
Newbie friendliness: 30