Benchmark ORC reader · dask/dask#3734

(8 comments) (0 reactions) (0 assignees)Python (1,658 forks)batch import

dataframegood first issueioneeds info

Repository metrics

Stars: (11,520 stars)
PR merge metrics: (平均マージ 5d 1h) (30d で 30 merged PRs)

説明

With #3284, we can now read ORC files into dask dataframes. It would be good/interesting to benchmark this implementation and see if there are any easy gains we're missing (this was never done). This would ideally be done at two levels:

Pandas/Arrow (are we getting the bandwidth we'd expect from the ORC c++ reader)
Dask (are we getting parallelism, is our overhead low)

We'd need to use some other system (spark, hive, etc...) to generate test files, as no python writer exists.

コントリビューターガイド

調査方針: ORCの読み取りパフォーマンスをCSVおよびParquetと比較するベンチマークを設定します。SparkまたはHiveで生成したテストファイルを使用します。rawリード帯域幅とDaskの並列性の両方を測定します。
技術スタック: python
領域: performance
Issue 種別: パフォーマンス
難度: 2
推定時間: 1-3時間
活動状況: アクティブ
明確さ: 明確
前提条件: PythonGit
初心者向け度: 65

Repository metrics

説明

コントリビューターガイド

新着 Easy issues をメールで受け取る。