数组列中的唯一元素计数
我的数据集带有数组类型的列。在此列中,我们需要创建另一列,该列将包含唯一元素及其计数的列表。
示例[a,b,e,b]
结果应为[[b,a,e],[2,1,1]]
。数据应按数量进行排序。即使是值为数值的钥匙值也会做到。为此,我创建了一个udf
(请参见下文),但是它非常慢,因此我需要在Pyspark内置功能中执行此操作。
ID | col_a | Collected_col_a |
---|---|---|
1 | a | [a,b,e,b] |
1 | b | [a,b,e,b] |
struct_schema1 = StructType([
StructField('elements', ArrayType(StringType()), nullable=True),
StructField('count', ArrayType(IntegerType()), nullable=True)
])
# udf
@udf(returnType=struct_schema1)
def func1(x, top = 10):
y,z=np.unique(x,return_counts=True)
z_y = zip(z.tolist(), y.tolist())
y = [i for _, i in sorted(z_y, reverse = True)]
z = sorted(z.tolist(), reverse = True)
if len(y) > top:
return {'elements': y[:top],'count': z[:top]}
else:
return {'elements': y,'count': z}
I have this dataset with a column of array type. From this column, we need to create another column which will have list of unique elements and its counts.
Example [a,b,e,b]
results should be [[b,a,e],[2,1,1]]
. Data should be sorted by count. Even key value where value is the count will do. I created a udf
(please see below) for this purpose, but it is very slow so I need to do this in PySpark built-in functions.
id | col_a | collected_col_a |
---|---|---|
1 | a | [a, b, e, b] |
1 | b | [a, b, e, b] |
struct_schema1 = StructType([
StructField('elements', ArrayType(StringType()), nullable=True),
StructField('count', ArrayType(IntegerType()), nullable=True)
])
# udf
@udf(returnType=struct_schema1)
def func1(x, top = 10):
y,z=np.unique(x,return_counts=True)
z_y = zip(z.tolist(), y.tolist())
y = [i for _, i in sorted(z_y, reverse = True)]
z = sorted(z.tolist(), reverse = True)
if len(y) > top:
return {'elements': y[:top],'count': z[:top]}
else:
return {'elements': y,'count': z}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用转换 和
filter
与array_distinct
和size
一起函数,以获取所需的输出。这是一个例子:You can use combination of
transform
andfilter
functions along witharray_distinct
andsize
to get the desired output. Here's and example:创建地图的方法。使用
centregate
and抱歉,我无法弄清楚如何根据地图值进行排序。
An approach creating a map. Using
aggregate
andmap_zip_with
. The other approach seems clearer though.Sorry, I couldn't figure out how to sort based on map values.