`athena.to_parquet` fails when `mode=overwrite_partitions` and `partition_cols` contains something like `hour(timestamp_col)`. · aws/aws-sdk-pandas#2845

(4 评论) (0 反应) (0 负责人)Python (630 fork)batch import

backlogbughelp wanted

仓库指标

Star: (3,560 star)
PR 合并指标: (平均合并 6天 23小时) (30 天内合并 37 个 PR)

描述

Describe the bug

When using s3.to_parquet to update a parquet file that is partitioned by a time interval or a timestamp "attribute" (such as year, month, hour, etc.), the function fails because for this mode the implementation assumes that the values of partition_cols are names of the parquet / table columns, and it does not find something like hour(column) in the dataframe columns.

I think the problem is this line, which uses the function delete_from_iceberg_table, which expects column names.

How to Reproduce

Expected behavior

I expect the partition_cols option to accept anything that can be used to partition a parquet. In particular, anything that is accepted when the argument mode is append or overwrite instead of overwrite_partitions.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10

AWS SDK for pandas version

3.7.3

Additional context

No response

贡献者指南

研究方向: 检查 `awswrangler/athena/ write iceberg.py` 中第 452 行附近的 `delete from iceberg table` 函数，了解它如何处理 `partition cols`。确定是否可以通过解析诸如 `hour(column)` 之类的表达式来提取列名以进行删除，或者通过修改逻辑以处理转换后的分区列来解决问题。
技术栈: python
领域: backend
议题类型: 缺陷
难度: 2
预计时间: 1-3 小时
活动状态: 活跃
清晰度: 基本清晰
前置要求: Python
新手友好度: 70

仓库指标

描述

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

贡献者指南

每天在邮箱收到新鲜 Easy issues。