使用Pyspark CountVectorizer获取顶部关键字
我想使用pyspark.ml.feature.countvectorizer
。
提取关键字。
我的输入火花数据帧看起来如下:
ID | 文本 |
---|---|
1 | 太阳,火星,太阳系,太阳系,太阳系,火星,太阳系,金星,太阳系,太阳系,火星 |
2 | 行星,月亮,银河系,银河系,银河系,月亮,月亮,银河系,太阳,太阳,银河系,火星, |
我应用以下管道:
# Convert string to array
input_df = input_df.withColumn("text_array", split("text", ','))
cv_text = CountVectorizer() \
.setInputCol("text_array") \
.setOutputCol("cv_text")
cv_model = cv_text.fit(input_df)
cv_result = cv_model.transform(input_df)
cv_result.show()
输出:
id | text_array | [太阳,火星 | cv_text |
---|---|---|---|
1 | 太阳,火星,太阳系,.. | ,太阳系,.. | (9,[1,2,4,7] ,[3.0,4.0,1.0,1.0]) |
2 | 行星,月亮,银河系,.. | [行星,月亮,银河系,.. | (9,[0,1,1,3,5,6,8],[4.0 ,1.0,2.0,1.0,1.0,1.0]) |
现在如何获得每个ID
(对于每行)顶部N关键字(例如,前2个)?
预期输出:
id | text_array | cv_text | 关键字 | 3.0,4.4.0,0,1.0,1.0 |
---|---|---|---|---|
1 | 太阳,火星,太阳系,.. | [太阳,火星,太阳系,.. | (9,[1,2,4,7],[ ) | 太阳系,火星 |
2 | 行星,月亮,银河系,.. | [星球,月亮,银河系,.. | (9,[0,1,1,3,5,6,8] ,1.0,1.0]) | 银河系,月亮 |
我将非常感谢任何建议,文档,示例!
I want to extract keywords using pyspark.ml.feature.CountVectorizer
.
My input Spark dataframe looks as following:
id | text |
---|---|
1 | sun, mars, solar system, solar system, mars, solar system, venus, solar system, mars |
2 | planet, moon, milky way, milky way, moon, milky way, sun, milky way, mars, star |
I applied the following pipeline:
# Convert string to array
input_df = input_df.withColumn("text_array", split("text", ','))
cv_text = CountVectorizer() \
.setInputCol("text_array") \
.setOutputCol("cv_text")
cv_model = cv_text.fit(input_df)
cv_result = cv_model.transform(input_df)
cv_result.show()
Output:
id | text | text_array | cv_text |
---|---|---|---|
1 | sun, mars, solar system, .. | [sun, mars, solar system, .. | (9,[1,2,4,7],[3.0,4.0,1.0,1.0]) |
2 | planet, moon, milky way, .. | [planet, moon, milky way, .. | (9,[0,1,3,5,6,8],[4.0,1.0,2.0,1.0,1.0,1.0]) |
How can I now get for each id
(for each row) top n keywords (top 2, for example)?
Expected output:
id | text | text_array | cv_text | keywords |
---|---|---|---|---|
1 | sun, mars, solar system, .. | [sun, mars, solar system, .. | (9,[1,2,4,7],[3.0,4.0,1.0,1.0]) | solar system, mars |
2 | planet, moon, milky way, .. | [planet, moon, milky way, .. | (9,[0,1,3,5,6,8],[4.0,1.0,2.0,1.0,1.0,1.0]) | milky way, moon |
I will be very grateful for any advice, docs, examples!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
除了在
pyspark.ml.feature
模块中,我还没有找到与稀疏向量一起使用的方法。下面的函数使用
np.argpartition
找到向量值的顶部n值并返回其索引,我们可以方便地将其放入向量指数中以获取值。返回的值是词汇索引,而不是实际单词。如果词汇不是那么大,我们可以将其作为自己的数组列,并将IDX转换为实际单词。
我不确定我对上述解决方案是否会良好,可能无法扩展。
话虽如此,如果您实际上不需要CountVectorizer,则可以在
input_df
上可以执行的标准功能组合,以简单地获取每个句子的TOP_N单词。I haven't found a way to work with Sparse Vectors besides very few operations in the
pyspark.ml.feature
module so for something like taking the top n values I would say a UDF is the way to go.The function below uses
np.argpartition
to find the top n values of vector values and return their indices which conveniently we can put in the vector indices to get the values.The values returned are the vocabulary index and not the actual word. If the vocabulary is not that big we can put it as an array column of its own and transform the idx to the actual word.
I'm not sure I feel that good with the solution above, probably not scalable.
That being said, if you don't actually need the CountVectorizer , there is a combination of standard functions we can do on the
input_df
to simply get the top_n words of every sentence.