[FEA] Add pipelining to the `NDS-H-cpp` benchmarks · rapidsai/cudf#18206

Repository metrics

Stars: (6,000 stars)
PR merge metrics: (平均マージ 17d 21h) (30d で 230 merged PRs)

説明

Is your feature request related to a problem? Please describe. In the libcudf microbenchmarks, the NDS-H-cpp benchmarks are a useful tool for studying GPU query performance.

They could also be used to study pipelining. An application can "pipeline" work on the GPU using 2 or more host threads to sequence calls to the libcudf public API. Pipelining is useful in IO-heavy workloads where one thread can be copying data to the GPU while another thread is running kernels over previously-copied data. Pipelining is needed to ensure that GPU compute is not left idle during copying steps.

Describe the solution you'd like Claude and I wrote a simple concurrent benchmark for query 5 using PTDS. We could take this idea and update to use a CUDA stream pool. We would also want to consider how pipelining could be applied to other queries without modifying each query file.

void ndsh_q5_concurrent(nvbench::state& state)
{
  // Generate the required parquet files in device buffers
  double const scale_factor = state.get_float64("scale_factor");
  int const num_threads = state.get_int64("num_threads");
  int const runs_per_thread = state.get_int64("runs_per_thread");
  
  std::unordered_map<std::string, cuio_source_sink_pair> sources;
  generate_parquet_data_sources(
    scale_factor, {"customer", "orders", "lineitem", "supplier", "nation", "region"}, sources);
  
  BS::thread_pool threads(num_threads);
  
  state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
    nvtxRangePushA(("ndsh_q5_concurrent " + std::to_string(num_threads) + " threads, " + 
                    std::to_string(runs_per_thread) + " runs/thread, scale_factor=" + 
                    std::to_string(scale_factor)).c_str());
    auto query_func = [&](int index) {
      nvtxRangePushA("ndsh_q5");
      run_ndsh_q5(state, sources);
      nvtxRangePop();
    };

    threads.pause();
    threads.detach_sequence(0, num_threads * runs_per_thread, query_func);
    threads.unpause();
    threads.wait();
    nvtxRangePop();
  });
}

NVBENCH_BENCH(ndsh_q5_concurrent)
  .set_name("ndsh_q5_concurrent")
  .add_float64_axis("scale_factor", {0.01, 0.1, 1})
  .add_int64_axis("num_threads", {2, 4})
  .add_int64_axis("runs_per_thread", {1, 4});

The profiles show that query 5 is IO-bound and yet still has some bubbles where compute is running, but not IO. We should investigate why IO is blocking kernel work in some cases.

コントリビューターガイド

調査方針: 既存のNDS H cppベンチマークを調査し、並行ベンチマークの例を理解し、異なるクエリ間でのパイプライン処理のためにCUDAストリームプールを実装する方法を探求してください。
技術スタック: cpp
領域: performance
Issue 種別: 機能
難度: 3
推定時間: 半日
活動状況: 新着
明確さ: 明確
前提条件: C++CUDAGPU programminglibcudf basics
初心者向け度: 25

Repository metrics

説明

コントリビューターガイド

新着 Easy issues をメールで受け取る。