説明
Is your feature request related to a problem? Please describe.
In the libcudf microbenchmarks, the NDS-H-cpp benchmarks are a useful tool for studying GPU query performance.
They could also be used to study pipelining. An application can "pipeline" work on the GPU using 2 or more host threads to sequence calls to the libcudf public API. Pipelining is useful in IO-heavy workloads where one thread can be copying data to the GPU while another thread is running kernels over previously-copied data. Pipelining is needed to ensure that GPU compute is not left idle during copying steps.
Describe the solution you'd like Claude and I wrote a simple concurrent benchmark for query 5 using PTDS. We could take this idea and update to use a CUDA stream pool. We would also want to consider how pipelining could be applied to other queries without modifying each query file.
void ndsh_q5_concurrent(nvbench::state& state)
{
// Generate the required parquet files in device buffers
double const scale_factor = state.get_float64("scale_factor");
int const num_threads = state.get_int64("num_threads");
int const runs_per_thread = state.get_int64("runs_per_thread");
std::unordered_map<std::string, cuio_source_sink_pair> sources;
generate_parquet_data_sources(
scale_factor, {"customer", "orders", "lineitem", "supplier", "nation", "region"}, sources);
BS::thread_pool threads(num_threads);
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
nvtxRangePushA(("ndsh_q5_concurrent " + std::to_string(num_threads) + " threads, " +
std::to_string(runs_per_thread) + " runs/thread, scale_factor=" +
std::to_string(scale_factor)).c_str());
auto query_func = [&](int index) {
nvtxRangePushA("ndsh_q5");
run_ndsh_q5(state, sources);
nvtxRangePop();
};
threads.pause();
threads.detach_sequence(0, num_threads * runs_per_thread, query_func);
threads.unpause();
threads.wait();
nvtxRangePop();
});
}
NVBENCH_BENCH(ndsh_q5_concurrent)
.set_name("ndsh_q5_concurrent")
.add_float64_axis("scale_factor", {0.01, 0.1, 1})
.add_int64_axis("num_threads", {2, 4})
.add_int64_axis("runs_per_thread", {1, 4});
The profiles show that query 5 is IO-bound and yet still has some bubbles where compute is running, but not IO. We should investigate why IO is blocking kernel work in some cases.