aws/aws-sdk-pandas

`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`.

Open

#2,845 创建于 2024年6月4日

在 GitHub 查看
 (3 评论) (0 反应) (0 负责人)Python (3,560 star) (630 fork)batch import
backlogbughelp wanted

描述

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

贡献者指南