rapidsai/cudf

[FEA] Explicitly guarantee row group ordering in the parquet reader.

Open

#15,697 opened on May 7, 2024

View on GitHub
 (0 comments) (0 reactions) (0 assignees)C++ (6,000 stars) (735 forks)batch import
cuIOfeature requestgood first issueimprovementlibcudf

Description

From @devavret , the question came up as to whether we guarantee the relative ordering of row groups across multiple input files in the parquet reader. That is, if you have two files [f1, f2] and the row groups within the files (in one column) are specified as [[r0,r3], [r0,r1]], do we guarantee the output ordering would be [f1r0, f1r3, f2r0, f2r1]

The code does in fact do this for both the explicitly specified case and the unspecified (empty user input / all row groups), but we don't make any guarantees about it. Seems like a safe and easy thing to add.

https://github.com/rapidsai/cudf/blob/5d244dfc13f4db0b1e41ded3029942fec50c98f6/cpp/src/io/parquet/reader_impl_helpers.cpp#L663

Contributor guide

[FEA] Explicitly guarantee row group ordering in the parquet reader. · rapidsai/cudf#15697 | Good First Issue