udf:' typeError:' int'对象是不可能的'
我正在将Scala代码转换为Python。 Scala代码正在使用UDF。
def getVectors(searchTermsToProcessWithTokens: Dataset[Person]): Dataset[Person] = {
import searchTermsToProcessWithTokens.sparkSession.implicits._
def addVectors(
tokensToSearchFor: String,
tokensToSearchIn: String
): Seq[Int] = {
tokensToSearchFor.map(token => if (tokensToSearchIn.contains(token)) 1 else 0)
}
val addVectorsUdf: UserDefinedFunction = udf(addVectors _)
searchTermsToProcessWithTokens
.withColumn("search_term_vector", addVectorsUdf($"name", $"age"))
.withColumn("keyword_text_vector", addVectorsUdf($"name", $"age"))
.as[Person]
}
我的Python转换:
def getVectors(searchTermsToProcessWithTokens):
def addVectors(tokensToSearchFor: str, tokensToSearchIn: str):
tokensToSearchFor = [1 if (token in tokensToSearchIn) else 0 for token in tokensToSearchIn]
return tokensToSearchFor
addVectorsUdf= udf(addVectors, ArrayType(StringType()))
TokenizedSearchTerm = searchTermsToProcessWithTokens \
.withColumn("search_term_vector", addVectorsUdf(col("name"), col("age"))) \
.withColumn("keyword_text_vector", addVectorsUdf(col("name"), col("age")))
return TokenizedSearchTerm
在Scala中定义简单数据集,就像
case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
personDS.show()
// +------+---+
// | name|age|
// +------+---+
// | Max| 33|
// | Adam| 32|
// |Muller| 62|
// +------+---+
我从Scala函数中获取输出
val x= getVectors(personDS)
x.show()
// +------+---+------------------+-------------------+
// | name|age|search_term_vector|keyword_text_vector|
// +------+---+------------------+-------------------+
// | Max| 33| [0, 0, 0]| [0, 0, 0]|
// | Adam| 32| [0, 0, 0, 0]| [0, 0, 0, 0]|
// |Muller| 62|[0, 0, 0, 0, 0, 0]| [0, 0, 0, 0, 0, 0]|
// +------+---+------------------+-------------------+
一样,但是对于同一定义的PySpark DataFrame,
%python
personDF = spark.createDataFrame([["Max", 32], ["Adam", 33], ["Muller", 62]], ['name', 'age'])
+------+---+
| name|age|
+------+---+
| Max| 32|
| Adam| 33|
|Muller| 62|
+------+---+
我从Python版本获得
从udf抛出了一个例外:'typeError:'int'对象是不可能的'
此转换出了什么问题?
I'm converting Scala code to Python. Scala code is using UDF.
def getVectors(searchTermsToProcessWithTokens: Dataset[Person]): Dataset[Person] = {
import searchTermsToProcessWithTokens.sparkSession.implicits._
def addVectors(
tokensToSearchFor: String,
tokensToSearchIn: String
): Seq[Int] = {
tokensToSearchFor.map(token => if (tokensToSearchIn.contains(token)) 1 else 0)
}
val addVectorsUdf: UserDefinedFunction = udf(addVectors _)
searchTermsToProcessWithTokens
.withColumn("search_term_vector", addVectorsUdf(quot;name", quot;age"))
.withColumn("keyword_text_vector", addVectorsUdf(quot;name", quot;age"))
.as[Person]
}
My Python conversion:
def getVectors(searchTermsToProcessWithTokens):
def addVectors(tokensToSearchFor: str, tokensToSearchIn: str):
tokensToSearchFor = [1 if (token in tokensToSearchIn) else 0 for token in tokensToSearchIn]
return tokensToSearchFor
addVectorsUdf= udf(addVectors, ArrayType(StringType()))
TokenizedSearchTerm = searchTermsToProcessWithTokens \
.withColumn("search_term_vector", addVectorsUdf(col("name"), col("age"))) \
.withColumn("keyword_text_vector", addVectorsUdf(col("name"), col("age")))
return TokenizedSearchTerm
Defining simple Dataset in Scala like
case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
personDS.show()
// +------+---+
// | name|age|
// +------+---+
// | Max| 33|
// | Adam| 32|
// |Muller| 62|
// +------+---+
I'm getting output from Scala function
val x= getVectors(personDS)
x.show()
// +------+---+------------------+-------------------+
// | name|age|search_term_vector|keyword_text_vector|
// +------+---+------------------+-------------------+
// | Max| 33| [0, 0, 0]| [0, 0, 0]|
// | Adam| 32| [0, 0, 0, 0]| [0, 0, 0, 0]|
// |Muller| 62|[0, 0, 0, 0, 0, 0]| [0, 0, 0, 0, 0, 0]|
// +------+---+------------------+-------------------+
But for the same defined PySpark DataFrame
%python
personDF = spark.createDataFrame([["Max", 32], ["Adam", 33], ["Muller", 62]], ['name', 'age'])
+------+---+
| name|age|
+------+---+
| Max| 32|
| Adam| 33|
|Muller| 62|
+------+---+
I'm getting from Python version
An exception was thrown from a UDF: 'TypeError: 'int' object is not iterable'
What it is wrong with this conversion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是因为您的
tokenstosearchin
是整数,而应该是字符串。以下工作:出于好奇心,您不需要UDF。但这看起来并不简单...
It's because your
tokensToSearchIn
is integer, while it should be string. The following works:For the sake of curiosity, you don't need a UDF. But it doesn't look much simpler...