dask/dask

Benchmark ORC reader

Open

#3,734 opened on 2018年7月6日

GitHub で見る
 (8 comments) (0 reactions) (0 assignees)Python (11,520 stars) (1,658 forks)batch import
dataframegood first issueioneeds info

説明

With #3284, we can now read ORC files into dask dataframes. It would be good/interesting to benchmark this implementation and see if there are any easy gains we're missing (this was never done). This would ideally be done at two levels:

  • Pandas/Arrow (are we getting the bandwidth we'd expect from the ORC c++ reader)
  • Dask (are we getting parallelism, is our overhead low)

We'd need to use some other system (spark, hive, etc...) to generate test files, as no python writer exists.

コントリビューターガイド