aws/aws-sdk-pandas

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`.

Open

#2,845 opened on 2024年6月4日

GitHub で見る
 (3 comments) (0 reactions) (0 assignees)Python (3,560 stars) (630 forks)batch import
backlogbughelp wanted

説明

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

コントリビューターガイド