根据数组列中的元素的存在过滤所有行

发布于 2025-01-25 08:33:33 字数 396 浏览 2 评论 0原文

类似于

以下	的	火花
我	有	一个
框	数据 “：5，...}]	...

规则是一个数组列，每个元素都具有id字段。

我想过滤所有包含id＆lt的规则的行； 3。如果没有UDFS ，是否可以这样做？我的数据帧非常大，UDF会损害我查询的perofrmace。

原文

I have a spark dataframe similar to this:

...	Rules	...
...	[{"id": 1,...}, {"id": 2}]	...
...	[{"id": 5,...}]	...

Rules is an array column, where each element has the id field.

And I want to filter all rows which contain a rule with id < 3. Is it possible to do this without UDFs? My dataframe is very large and UDFs impair the perofrmace of my query.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅此而已 2025-02-01 08:33:33

您可以在高阶功能上使用可用的

# Given dataset
+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
+----------------------+

import pyspark.sql.functions as f

df_filtered_rules = df.where(f.expr("EXISTS(Rules, rule -> rule.id < 3)"))
df_filtered_rules.show(truncate=False)

+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
+----------------------+

You can use the EXISTS available on higher-order functions

# Given dataset
+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
+----------------------+

import pyspark.sql.functions as f

df_filtered_rules = df.where(f.expr("EXISTS(Rules, rule -> rule.id < 3)"))
df_filtered_rules.show(truncate=False)

+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
+----------------------+

回复收藏 0 原文

简单 2025-02-01 08:33:33

从数组中提取映射值，在数组中找到最大值或最小值（根据所需的值），并检查其小于3。返回布尔值。然后，您可以使用Where子句过滤。

+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
+----------------------+

如果您只想过滤这些行少于3的行，请使用；

df.where(expr("array_max(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

如果要用少于3的ID过滤行，请使用；

df.where(expr("array_min(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}, {id -> 2}]|
+----------------------+

Extract map values from the array, find the max or(min depending on what you want) value in the array and check it is less than 3. That returns a boolean. You then can filter using the where clause.

+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
+----------------------+

If you want to filter only those rows with id less than 3, use;

df.where(expr("array_max(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

If you want to filter rows with any id less than 3, use;

df.where(expr("array_min(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

+----------------------+
|Rules                 |
+----------------------+
|[{id -> 1}, {id -> 2}]|
+----------------------+

回复收藏 0 原文

~没有更多了~