`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. · aws/aws-sdk-pandas#2845

Repository metrics

Stars: (3,560 stars)
PR merge metrics: (平均マージ 6d 23h) (30d で 37 merged PRs)

説明

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

コントリビューターガイド

調査方針: `awswrangler/athena/ write iceberg.py`の452行目付近にある`delete from iceberg table`関数を調べて、`partition cols`をどのように処理しているかを理解してください。`hour(column)`のような式を解析して列名を抽出し削除できるようにするか、変換されたパーティション列を処理するようにロジックを修正することで問題を解決できるかどうかを判断してください。
技術スタック: python
領域: backend
Issue 種別: バグ
難度: 2
推定時間: 1-3時間
活動状況: アクティブ
明確さ: おおむね明確
前提条件: Python
初心者向け度: 70

Repository metrics

説明

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

コントリビューターガイド

新着 Easy issues をメールで受け取る。