使用UDF时，无效的参数，而不是字符串或列

发布于 2025-01-27 20:15:57 字数 1295 浏览 2 评论 0原文

我有一个数据质量类可以在DF上进行检查。我使用此类定义的方法来运行这些检查（它们总是返回3个单元）。这些方法是由我想从另一个df调用的UDF调用的：

@F.udf(StructType())
def dq_check_wrapper(df, col, _test):
  
  if _test == 'is_null':
    return Valid_df(df).is_not_null(col).execute()

  elif _test == 'unique':
    return Valid_df(df).is_unique(col).execute()

说我想评估DQ上的DQ：

df = spark.createDataFrame(
  [
     (None, 128.0, 1),(110, 127.0, 2),(111, 127.0, 3),(111, 127.0, 4)
    ,(111, 126.0, 5),(111, 127.0, 6),(109, 126.0, 7),(111, 126.0, 1001)
    ,(114, 126.0, 1003),(115, 83.0, 1064),(116, 127.0, 1066)
  ], ['HR', 'maxABP', 'Second']
)

为了使其动态，我想使用元数据DF：

metadata = sqlContext.sql("select 'HR' as col, 'is_null' as dq_check")

+---+--------+
|col|dq_check|
+---+--------+
| HR| is_null|
+---+--------+

但是，当我尝试时，我会得到：

metadata\
  .withColumn("valid_dq", dq_check_wrapper(df, metadata.col, metadata.dq_check))\
  .show()

我得到一个typeerror：

TypeError: Invalid argument, not a string or column: DataFrame[HR: bigint, maxABP: double, Second: bigint] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

为什么？

原文

I have a data quality class to perform checks on a df. I use methods defined in this class to run these checks (they always return tuples of 3). These methods are called by a udf that I want to call from another df:

@F.udf(StructType())
def dq_check_wrapper(df, col, _test):
  
  if _test == 'is_null':
    return Valid_df(df).is_not_null(col).execute()

  elif _test == 'unique':
    return Valid_df(df).is_unique(col).execute()

Say I want to asses the DQ on this df:

df = spark.createDataFrame(
  [
     (None, 128.0, 1),(110, 127.0, 2),(111, 127.0, 3),(111, 127.0, 4)
    ,(111, 126.0, 5),(111, 127.0, 6),(109, 126.0, 7),(111, 126.0, 1001)
    ,(114, 126.0, 1003),(115, 83.0, 1064),(116, 127.0, 1066)
  ], ['HR', 'maxABP', 'Second']
)

To make it dynamic, I want to use a metadata df:

metadata = sqlContext.sql("select 'HR' as col, 'is_null' as dq_check")

+---+--------+
|col|dq_check|
+---+--------+
| HR| is_null|
+---+--------+

But then, when I try:

metadata\
  .withColumn("valid_dq", dq_check_wrapper(df, metadata.col, metadata.dq_check))\
  .show()

I get a TypeError:

TypeError: Invalid argument, not a string or column: DataFrame[HR: bigint, maxABP: double, Second: bigint] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Why?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月朦胧 2025-02-03 20:15:57

因为如果我不告知df是类型的数据框架，则它会以字符串为单词：

def dq_check_wrapper(df: DataFrame, col, _test):

Because if I don't inform that df is of type DataFrame, it infers as String:

def dq_check_wrapper(df: DataFrame, col, _test):

回复收藏 0 原文

~没有更多了~

关于作者

套路撩心

暂无简介

文章

25 人气

关注发私信

alipaysp_snBf0MSZIv

文章 0 评论 0

关注

梦断已成空

文章 0 评论 0

关注

瞎闹

文章 0 评论 0

关注

凯凯我们等你回来

文章 0 评论 0

关注

寄意

文章 0 评论 0

关注

似梦非梦

文章 0 评论 0

友情链接

文江博客

使用UDF时，无效的参数，而不是字符串或列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

使用UDF时，无效的参数，而不是字符串或列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。