[FEA] Pass column indices as `index_col` in `read_csv` · rapidsai/cudf#15127

(7 comments) (0 reactions) (0 assignees)C++ (735 forks)batch import

0 - BacklogPythonfeature requestgood first issue

Repository metrics

Stars: (6,000 stars)
PR merge metrics: (Avg merge 17d 21h) (230 merged PRs in 30d)

Description

If I want to use the index_col parameter to set certain columns as indices when reading a csv file, I cannot pass a list of column indices (like in pandas). I can pass a list of column labels though:

cudf.read_csv(filepath, index_col=[0])
KeyError: 'None of [0] are in the columns'

cudf.read_csv(filepath, index_col=['family'])

While this is not a huge issue, I imagine the following is a common scenario: You have know that the first 3 columns are index columns, but you don't exactly know how each are spelt ('date' vs 'Date' etc.). In this case, if passing a list of column indices were possible, index_col=[0,1,2] would have worked fine; otherwise, you will have to read the file without specifying index columns and set index later (or require trial and error to guess the column labels).

Is it possible for index_col to accept list of indices like in pandas?

Contributor guide

Research direction: Implement support for integer column indices in `index col` parameter of `read csv` function, mimicking pandas behavior.
Tech stack: python
Domain: data
Issue type: Feature
Difficulty: 2
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: Python
Newbie friendliness: 70

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.