使用Python的Spark流:如何添加UUID列?
我想在我的数据框架中添加带有生成ID的列。我尝试过:
uuidUdf = udf(lambda x: str(uuid.uuid4()), StringType())
df = df.withColumn("id", uuidUdf())
但是,当我这样做时,我的输出目录没有任何写作。当我删除这些行时,一切都可以正常工作,因此必须有一些错误,但我在控制台中没有看到任何内容。
我尝试使用单调的_increasing_id()而不是生成uuid,但是在我的测试中,这会产生许多重复。我需要一个唯一的标识符(不必具体是UUID)。
我该怎么做?
I would like to add a column with a generated id to my data frame. I have tried:
uuidUdf = udf(lambda x: str(uuid.uuid4()), StringType())
df = df.withColumn("id", uuidUdf())
however, when I do this, nothing is written to my output directory. When I remove these lines, everything works fine so there must be some error but I don't see anything in the console.
I have tried using monotonically_increasing_id() instead of generating a UUID but in my testing, this produces many duplicates. I need a unique identifier (does not have to be a UUID specifically).
How can I do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一种简单的方法:
A simple way:
请尝试以下操作:
注意:添加新列后,您应该分配给新的DF。 ( df1 = df.withcolumn(....)
Please Try this:
Note: You should assign to new DF after adding new column. (Df1 = Df.withColumn(....)
来自
pyspark
'sfunction.py
:因此,对于uuid来说,这将是:
用法:
From
pyspark
'sfunctions.py
:So for a UUID this would be:
and the usage:
请使用
lit
函数,以便为所有记录生成相同的ID。lit
仅执行一次功能,并获取列值并将其添加到每个记录中。使用
UDF
不会在每行调用函数时求解该函数,并且我们最终会为每个调用获得新的UUID。Please use
lit
function so that you generate same id for all the records.lit
performs the function only once and gets the column value and adds it to every record.Using
udf
won't solve the function as it gets called for every row, and we end up getting new uuid's for each call.我正在使用
pyspark =“ == 3.2.1”
,您可以添加uuid 版本很像以下更新
,似乎我最终使用了UDF函数
I'm using
pyspark= "==3.2.1"
, you can add your uuid version easly like the followingUpdate
It seems that I ended up using an UDF function