Deltalake：替换在日期格式的地方不工作

发布于 2025-02-08 17:50:08 字数 1160 浏览 2 评论 0原文

我的用例是我想在日期分区。在不同的日期，将附加行，但是如果代码在同一日期重新运行，则应

在网上查看后覆盖，似乎可以使用Deltalake的替换Whather Where功能完成此任务，但是我对任何解决方案都很好，我可以涉及Parquet

我有以下代码：

from datetime import date
from pyspark.sql import SparkSession
from pyspark.sql.types import DateType
from pyspark.sql.types import StringType
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType

data = [(date(2022, 6, 19),  "Hello"), (date(2022, 6, 19), "World")]

schema = StructType([StructField("date", DateType()),StructField("message", StringType())])

df = spark.createDataFrame(data, schema=schema)
df.write.partitionBy("date").option("replaceWhere", f"date = '2022-06-19'").save(f"/tmp/test", mode="overwrite", format='delta')
df.write.partitionBy("date").option("replaceWhere", f"date = '2022-06-19'").save(f"/tmp/test_3", mode="overwrite", format='delta')

在第二个写调用时，代码引发以下例外：

pyspark.sql.utils.AnalysisException: Data written out does not match replaceWhere 'date = '2022-06-19''.
CHECK constraint EXPRESSION(('date = 2022-06-19)) (date = '2022-06-19') violated by row with values:
 - date : 17337

原文

my use case is that I want to partition my table on date. On different dates the rows will be appended but if the code is rerun on the same date then it should be overwritten

After looking online it seemed like this task can be done using deltalake's replacewhere feature but I am fine with any solution that involves parquet

I have the following code:

from datetime import date
from pyspark.sql import SparkSession
from pyspark.sql.types import DateType
from pyspark.sql.types import StringType
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType

data = [(date(2022, 6, 19),  "Hello"), (date(2022, 6, 19), "World")]

schema = StructType([StructField("date", DateType()),StructField("message", StringType())])

df = spark.createDataFrame(data, schema=schema)
df.write.partitionBy("date").option("replaceWhere", f"date = '2022-06-19'").save(f"/tmp/test", mode="overwrite", format='delta')
df.write.partitionBy("date").option("replaceWhere", f"date = '2022-06-19'").save(f"/tmp/test_3", mode="overwrite", format='delta')

At the second write call the code throws the following exception:

pyspark.sql.utils.AnalysisException: Data written out does not match replaceWhere 'date = '2022-06-19''.
CHECK constraint EXPRESSION(('date = 2022-06-19)) (date = '2022-06-19') violated by row with values:
 - date : 17337

分享到QQ

分享到微博