在AWS s3上删除Delta Lake分区的正确方法

发布于 2025-01-11 20:16:36 字数 406 浏览 3 评论 0原文

我需要删除具有关联 AWS s3 文件的 Delta Lake 分区,然后需要确保 AWS Athena 显示此更改。目的是因为我需要重新运行一些代码来重新填充数据。

我尝试了这个

deltaTable = DeltaTable.forPath(spark, path)
deltaTable.delete("extract_date = '2022-03-01'") #extract date is the partition

,它完成时没有错误,但 s3 上的文件仍然存在,即使在删除后运行 MSK REPAIR TABLE 后,Athena 仍然显示数据。有人可以建议删除分区和更新 Athena 的最佳方法吗?

I need to delete a Delta Lake partition with associated AWS s3 files and then need to make sure AWS Athena displays this change. The purpose is because I need to rerun some code to re-populate the data.

I tried this

deltaTable = DeltaTable.forPath(spark, path)
deltaTable.delete("extract_date = '2022-03-01'") #extract date is the partition

And it completed with no errors but the files on s3 still exist and Athena still shows the data even after running MSK REPAIR TABLE after the delete. Can someone advise the best way to delete partitions and update Athena?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

允世 2025-01-18 20:16:36

虽然你执行了删除操作,但数据仍然存在,因为 Delta 表有历史记录,并且只有当你执行 VACUUM 操作 且操作时间将早于默认保留期(7 天)。如果您想更快地删除数据,则可以使用参数 RETAIN XXX HOURS 运行 VACUUM 命令,但这可能需要设置一些其他属性来强制执行 - 请参阅文档以获取更多详细信息。

Although you performed delete operation, data is still there because Delta tables have history, and actual deletion of the data will happen only when you execute VACUUM operation and operation time will be older than default retention period (7 days). If you want to remove data faster, then you can run VACUUM command with parameter RETAIN XXX HOURS, but this may require setting some additional properties to enforce that - refer documentation for more details.

错々过的事 2025-01-18 20:16:36

将添加到Alex的答案中,如果您想将保留期缩短到7天以下,则必须将配置属性: spark.databricks.delta.retentionDurationCheck.enabled 更改为 false。

来自原始文档:

Delta Lake 有安全检查,防止您发生危险
真空命令。如果您确定没有任何操作
在此表上执行的操作花费的时间长于保留间隔
您计划指定,您可以通过设置来关闭此安全检查
Spark 配置属性
Spark.databricks.delta.retentionDurationCheck.enabled 为 false。

will add to Alex's answer, if you want to shorten retention period less than 7 days, you have to change configuration property: spark.databricks.delta.retentionDurationCheck.enabled to false.

from original docs:

Delta Lake has a safety check to prevent you from running a dangerous
VACUUM command. If you are certain that there are no operations being
performed on this table that take longer than the retention interval
you plan to specify, you can turn off this safety check by setting the
Spark configuration property
spark.databricks.delta.retentionDurationCheck.enabled to false.

允世 2025-01-18 20:16:36

根据我的观察,我可以说 VACUUM 不会删除 s3 文件。我使用了默认保留时间(7 天)的 Vacuum,即使自运行 cmd 以来已经过了 7 天,我仍然可以在 s3 上看到 parq 文件

from my observation, I can say that VACUUM doesnt delete s3 files. I've used Vacuum with default retain hours(7 days) and i still see the parq files on s3 even after 7 days have elapsed since the cmd was run

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文