使用expr和过滤的列表中未在列表中的过滤值

发布于 2025-02-09 07:05:41 字数 786 浏览 2 评论 0原文

我想在列不是列表的一部分的数据框架中过滤排出行。

我知道我可以使用UDF进行此操作,并且可以使用。

def filterNegatives(val: Seq[String]): Seq[String] = {
    val.filter(v => !badList.contains(v))
}
val filterNegativesUdf = udf(filterNegatives _, ArrayType(StringType))

val cleanedDF = myDF.withColumn("pos" , filterNegativesUdf(col("allVals")))

想知道是否有一种非UDF实现这一目标的方法。

我已经尝试了以下操作,并且有效。

val cleanedDF = myDF.withColumn("pos", expr(s"filter(allVals, val -> val NOT IN ('badval1', 'badval2'))"))

但是我的列表坏列表包含约10个元素,我宁愿通过定义列表来保持代码清洁。

我尝试使用不同变化的内部过滤器列表,但所有列表都有一些错误。

.withColumn("pos", expr(s"filter(allVals, val NOT IN ${badList}"))

//error:no viable alternative at input 'NOT IN List'     

使用 - Scala版本2.11

I want to filter out rows in a dataframe where a column is not part of a list.

I am aware that I can use udf to go about this and it works.

def filterNegatives(val: Seq[String]): Seq[String] = {
    val.filter(v => !badList.contains(v))
}
val filterNegativesUdf = udf(filterNegatives _, ArrayType(StringType))

val cleanedDF = myDF.withColumn("pos" , filterNegativesUdf(col("allVals")))

Was wondering if there is a non udf way of achieving this.

I have tried the following and it works.

val cleanedDF = myDF.withColumn("pos", expr(s"filter(allVals, val -> val NOT IN ('badval1', 'badval2'))"))

but my list badList contains ~10 elements and I'd rather keep it the code clean by defining a list.

I have tried using the list inside filter in different variations, but all of them had some errors.

.withColumn("pos", expr(s"filter(allVals, val NOT IN ${badList}"))

//error:no viable alternative at input 'NOT IN List'     

Using - scala version 2.11

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

浅语花开 2025-02-16 07:05:41

考虑使用 aray_contains A>在您的高阶函数过滤器中,如下所示。

val df = Seq(
  (1, Array("a", "b", "c", "d", "e")),
  (2, Array("h", "i", "j", "k")),
  (3, Array("u", "u", "v", "v", "w", "w"))
).toDF("id", "values")

val badList = Array("a", "e", "i", "o", "u")

在Spark 3.x上:

df.
  withColumn("pos", filter($"values", v => !array_contains(lit(badList), v))).
  show
/*
+---+------------------+------------+
| id|            values|         pos|
+---+------------------+------------+
|  1|   [a, b, c, d, e]|   [b, c, d]|
|  2|      [h, i, j, k]|   [h, j, k]|
|  3|[u, u, v, v, w, w]|[v, v, w, w]|
+---+------------------+------------+
*/

在Spark 2.4上:

df.
  withColumn("bad_list", lit(badList)).
  withColumn("pos", expr("filter(values, v -> !array_contains(bad_list, v))")).
  drop("bad_list").
  show

请注意,您也可以考虑使用函数

df.
  withColumn("pos", array_except($"values", lit(badList))).
  show
/*
+---+------------------+---------+
| id|            values|      pos|
+---+------------------+---------+
|  1|   [a, b, c, d, e]|[b, c, d]|
|  2|      [h, i, j, k]|[h, j, k]|
|  3|[u, u, v, v, w, w]|   [v, w]|
+---+------------------+---------+
*/

Consider using array_contains within your higher-order function filter as shown below.

val df = Seq(
  (1, Array("a", "b", "c", "d", "e")),
  (2, Array("h", "i", "j", "k")),
  (3, Array("u", "u", "v", "v", "w", "w"))
).toDF("id", "values")

val badList = Array("a", "e", "i", "o", "u")

On Spark 3.x:

df.
  withColumn("pos", filter(
quot;values", v => !array_contains(lit(badList), v))).
  show
/*
+---+------------------+------------+
| id|            values|         pos|
+---+------------------+------------+
|  1|   [a, b, c, d, e]|   [b, c, d]|
|  2|      [h, i, j, k]|   [h, j, k]|
|  3|[u, u, v, v, w, w]|[v, v, w, w]|
+---+------------------+------------+
*/

On spark 2.4:

df.
  withColumn("bad_list", lit(badList)).
  withColumn("pos", expr("filter(values, v -> !array_contains(bad_list, v))")).
  drop("bad_list").
  show

Note that you could also consider using function array_except, but the catch is that any duplicates in the original array will be removed:

df.
  withColumn("pos", array_except(
quot;values", lit(badList))).
  show
/*
+---+------------------+---------+
| id|            values|      pos|
+---+------------------+---------+
|  1|   [a, b, c, d, e]|[b, c, d]|
|  2|      [h, i, j, k]|[h, j, k]|
|  3|[u, u, v, v, w, w]|   [v, w]|
+---+------------------+---------+
*/
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文