udf:' typeError:' int'对象是不可能的'

发布于 2025-02-06 11:51:09 字数 2634 浏览 2 评论 0原文

我正在将Scala代码转换为Python。 Scala代码正在使用UDF。

def getVectors(searchTermsToProcessWithTokens: Dataset[Person]): Dataset[Person] = {

    import searchTermsToProcessWithTokens.sparkSession.implicits._

    def addVectors(
      tokensToSearchFor: String,
      tokensToSearchIn: String
    ): Seq[Int] = {
      tokensToSearchFor.map(token => if (tokensToSearchIn.contains(token)) 1 else 0)
    }

    val addVectorsUdf: UserDefinedFunction = udf(addVectors _)

    searchTermsToProcessWithTokens
      .withColumn("search_term_vector", addVectorsUdf($"name", $"age"))
      .withColumn("keyword_text_vector", addVectorsUdf($"name", $"age"))
      .as[Person]
     }

我的Python转换:

def getVectors(searchTermsToProcessWithTokens): 

    def addVectors(tokensToSearchFor: str, tokensToSearchIn: str): 
      
      tokensToSearchFor = [1 if (token in tokensToSearchIn) else 0 for token in tokensToSearchIn]
      
      return tokensToSearchFor 
      
    addVectorsUdf= udf(addVectors, ArrayType(StringType()))

    TokenizedSearchTerm = searchTermsToProcessWithTokens \
      .withColumn("search_term_vector", addVectorsUdf(col("name"), col("age"))) \
      .withColumn("keyword_text_vector", addVectorsUdf(col("name"), col("age")))
      
      
    return TokenizedSearchTerm

在Scala中定义简单数据集,就像

case class Person(name: String, age: Int)

val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
personDS.show()
// +------+---+
// |  name|age|
// +------+---+
// |   Max| 33|
// |  Adam| 32|
// |Muller| 62|
// +------+---+

我从Scala函数中获取输出

val x= getVectors(personDS)

x.show()
// +------+---+------------------+-------------------+
// |  name|age|search_term_vector|keyword_text_vector|
// +------+---+------------------+-------------------+
// |   Max| 33|         [0, 0, 0]|          [0, 0, 0]|
// |  Adam| 32|      [0, 0, 0, 0]|       [0, 0, 0, 0]|
// |Muller| 62|[0, 0, 0, 0, 0, 0]| [0, 0, 0, 0, 0, 0]|
// +------+---+------------------+-------------------+

一样,但是对于同一定义的PySpark DataFrame,

%python
personDF = spark.createDataFrame([["Max", 32], ["Adam", 33], ["Muller", 62]], ['name', 'age'])

+------+---+
|  name|age|
+------+---+
|   Max| 32|
|  Adam| 33|
|Muller| 62|
+------+---+

我从Python版本获得

从udf抛出了一个例外:'typeError:'int'对象是不可能的'

此转换出了什么问题?

I'm converting Scala code to Python. Scala code is using UDF.

def getVectors(searchTermsToProcessWithTokens: Dataset[Person]): Dataset[Person] = {

    import searchTermsToProcessWithTokens.sparkSession.implicits._

    def addVectors(
      tokensToSearchFor: String,
      tokensToSearchIn: String
    ): Seq[Int] = {
      tokensToSearchFor.map(token => if (tokensToSearchIn.contains(token)) 1 else 0)
    }

    val addVectorsUdf: UserDefinedFunction = udf(addVectors _)

    searchTermsToProcessWithTokens
      .withColumn("search_term_vector", addVectorsUdf(
quot;name", 
quot;age"))
      .withColumn("keyword_text_vector", addVectorsUdf(
quot;name", 
quot;age"))
      .as[Person]
     }

My Python conversion:

def getVectors(searchTermsToProcessWithTokens): 

    def addVectors(tokensToSearchFor: str, tokensToSearchIn: str): 
      
      tokensToSearchFor = [1 if (token in tokensToSearchIn) else 0 for token in tokensToSearchIn]
      
      return tokensToSearchFor 
      
    addVectorsUdf= udf(addVectors, ArrayType(StringType()))

    TokenizedSearchTerm = searchTermsToProcessWithTokens \
      .withColumn("search_term_vector", addVectorsUdf(col("name"), col("age"))) \
      .withColumn("keyword_text_vector", addVectorsUdf(col("name"), col("age")))
      
      
    return TokenizedSearchTerm

Defining simple Dataset in Scala like

case class Person(name: String, age: Int)

val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
personDS.show()
// +------+---+
// |  name|age|
// +------+---+
// |   Max| 33|
// |  Adam| 32|
// |Muller| 62|
// +------+---+

I'm getting output from Scala function

val x= getVectors(personDS)

x.show()
// +------+---+------------------+-------------------+
// |  name|age|search_term_vector|keyword_text_vector|
// +------+---+------------------+-------------------+
// |   Max| 33|         [0, 0, 0]|          [0, 0, 0]|
// |  Adam| 32|      [0, 0, 0, 0]|       [0, 0, 0, 0]|
// |Muller| 62|[0, 0, 0, 0, 0, 0]| [0, 0, 0, 0, 0, 0]|
// +------+---+------------------+-------------------+

But for the same defined PySpark DataFrame

%python
personDF = spark.createDataFrame([["Max", 32], ["Adam", 33], ["Muller", 62]], ['name', 'age'])

+------+---+
|  name|age|
+------+---+
|   Max| 32|
|  Adam| 33|
|Muller| 62|
+------+---+

I'm getting from Python version

An exception was thrown from a UDF: 'TypeError: 'int' object is not iterable'

What it is wrong with this conversion?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

只为守护你 2025-02-13 11:51:09

这是因为您的tokenstosearchin是整数,而应该是字符串。以下工作:

def getVectors(searchTermsToProcessWithTokens): 
    def addVectors(tokensToSearchFor: str, tokensToSearchIn: str): 
      tokensToSearchFor = [1 if token in str(tokensToSearchIn) else 0 for token in tokensToSearchFor]
      return tokensToSearchFor
    addVectorsUdf = udf(addVectors, ArrayType(StringType()))

    TokenizedSearchTerm = searchTermsToProcessWithTokens \
      .withColumn("search_term_vector", addVectorsUdf(col("name"), col("age"))) \
      .withColumn("keyword_text_vector", addVectorsUdf(col("name"), col("age")))

    return TokenizedSearchTerm

出于好奇心,您不需要UDF。但这看起来并不简单...

def getVectors(searchTermsToProcessWithTokens): 
    def addVectors(tokensToSearchFor: str, tokensToSearchIn: str): 
      def arr(s: str):
        return F.split(F.col(s), '(?!$)')
      return transform(arr(tokensToSearchFor), lambda token: when(array_contains(arr(tokensToSearchIn), token), 1).otherwise(0))

    TokenizedSearchTerm = searchTermsToProcessWithTokens \
      .withColumn("search_term_vector", addVectors("name", "age")) \
      .withColumn("keyword_text_vector", addVectors("name", "age"))

    return TokenizedSearchTerm

It's because your tokensToSearchIn is integer, while it should be string. The following works:

def getVectors(searchTermsToProcessWithTokens): 
    def addVectors(tokensToSearchFor: str, tokensToSearchIn: str): 
      tokensToSearchFor = [1 if token in str(tokensToSearchIn) else 0 for token in tokensToSearchFor]
      return tokensToSearchFor
    addVectorsUdf = udf(addVectors, ArrayType(StringType()))

    TokenizedSearchTerm = searchTermsToProcessWithTokens \
      .withColumn("search_term_vector", addVectorsUdf(col("name"), col("age"))) \
      .withColumn("keyword_text_vector", addVectorsUdf(col("name"), col("age")))

    return TokenizedSearchTerm

For the sake of curiosity, you don't need a UDF. But it doesn't look much simpler...

def getVectors(searchTermsToProcessWithTokens): 
    def addVectors(tokensToSearchFor: str, tokensToSearchIn: str): 
      def arr(s: str):
        return F.split(F.col(s), '(?!$)')
      return transform(arr(tokensToSearchFor), lambda token: when(array_contains(arr(tokensToSearchIn), token), 1).otherwise(0))

    TokenizedSearchTerm = searchTermsToProcessWithTokens \
      .withColumn("search_term_vector", addVectors("name", "age")) \
      .withColumn("keyword_text_vector", addVectors("name", "age"))

    return TokenizedSearchTerm
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文