使用Pyspark CountVectorizer获取顶部关键字

发布于 2025-02-04 18:21:44 字数 1548 浏览 4 评论 0原文

我想使用pyspark.ml.feature.countvectorizer。
提取关键字。
我的输入火花数据帧看起来如下：

ID	文本
1	太阳，火星，太阳系，太阳系，太阳系，火星，太阳系，金星，太阳系，太阳系，火星
2	行星，月亮，银河系，银河系，银河系，月亮，月亮，银河系，太阳，太阳，银河系，火星，

我应用以下管道：

# Convert string to array
input_df = input_df.withColumn("text_array", split("text", ','))

cv_text = CountVectorizer() \
    .setInputCol("text_array") \
    .setOutputCol("cv_text")

cv_model = cv_text.fit(input_df)
cv_result = cv_model.transform(input_df)

cv_result.show()

输出：

id	text_array	[太阳，火星	cv_text
1	太阳，火星，太阳系，..	，太阳系，..	（9，[1,2,4,7] ，[3.0,4.0,1.0,1.0]）
2	行星，月亮，银河系，..	[行星，月亮，银河系，..	（9，[0,1,1,3,5,6,8]，[4.0 ，1.0,2.0,1.0,1.0,1.0]）

现在如何获得每个ID（对于每行）顶部N关键字（例如，前2个）？
预期输出：

id	text_array	cv_text	关键字	3.0,4.4.0,0,1.0,1.0
1	太阳，火星，太阳系，..	[太阳，火星，太阳系，..	（9，[1,2,4,7]，[ ）	太阳系，火星
2	行星，月亮，银河系，..	[星球，月亮，银河系，..	（9，[0,1,1,3,5,6,8] ，1.0,1.0]）	银河系，月亮

我将非常感谢任何建议，文档，示例！

原文

I want to extract keywords using pyspark.ml.feature.CountVectorizer.
My input Spark dataframe looks as following:

id	text
1	sun, mars, solar system, solar system, mars, solar system, venus, solar system, mars
2	planet, moon, milky way, milky way, moon, milky way, sun, milky way, mars, star

I applied the following pipeline:

# Convert string to array
input_df = input_df.withColumn("text_array", split("text", ','))

cv_text = CountVectorizer() \
    .setInputCol("text_array") \
    .setOutputCol("cv_text")

cv_model = cv_text.fit(input_df)
cv_result = cv_model.transform(input_df)

cv_result.show()

Output:

id	text	text_array	cv_text
1	sun, mars, solar system, ..	[sun, mars, solar system, ..	(9,[1,2,4,7],[3.0,4.0,1.0,1.0])
2	planet, moon, milky way, ..	[planet, moon, milky way, ..	(9,[0,1,3,5,6,8],[4.0,1.0,2.0,1.0,1.0,1.0])

How can I now get for each id (for each row) top n keywords (top 2, for example)?
Expected output:

id	text	text_array	cv_text	keywords
1	sun, mars, solar system, ..	[sun, mars, solar system, ..	(9,[1,2,4,7],[3.0,4.0,1.0,1.0])	solar system, mars
2	planet, moon, milky way, ..	[planet, moon, milky way, ..	(9,[0,1,3,5,6,8],[4.0,1.0,2.0,1.0,1.0,1.0])	milky way, moon

I will be very grateful for any advice, docs, examples!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

极致的悲 2025-02-11 18:21:44

除了在pyspark.ml.feature模块中，我还没有找到与稀疏向量一起使用的方法。

下面的函数使用np.argpartition找到向量值的顶部n值并返回其索引，我们可以方便地将其放入向量指数中以获取值。

import numpy as np
from pyspark.sql.functions import udf

@udf("array<integer>")
def get_top_n(v, n):
    top_n_indices = np.argpartition(v.values, -n)[-n:]
    return [int(x) for x in v.indices[top_n_indices]]

返回的值是词汇索引，而不是实际单词。如果词汇不是那么大，我们可以将其作为自己的数组列，并将IDX转换为实际单词。

from pyspark.sql.functions import col, transform

voc = spark.createDataFrame([(cv_model.vocabulary,)], ["voc"])

cv_result \
.withColumn("top_2", get_top_n("cv_text", lit(2))) \
.crossJoin(voc) \
.withColumn("top_2_parsed", transform("top_2", lambda v: col("voc")[v])) \
.show() 

+---+--------------------+--------------------+--------------------+------+--------------------+--------------------+
| id|                text|          text_array|             cv_text| top_2|                 voc|        top_2_parsed|
+---+--------------------+--------------------+--------------------+------+--------------------+--------------------+
|  1|sun, mars, solar ...|[sun,  mars,  sol...|(9,[1,2,4,7],[4.0...|[2, 1]|[ milky way,  sol...|[ mars,  solar sy...|
|  2|planet, moon, mil...|[planet,  moon,  ...|(9,[0,2,3,5,6,8],...|[3, 0]|[ milky way,  sol...| [ moon,  milky way]|
+---+--------------------+--------------------+--------------------+------+--------------------+--------------------+

我不确定我对上述解决方案是否会良好，可能无法扩展。
话虽如此，如果您实际上不需要CountVectorizer，则可以在input_df上可以执行的标准功能组合，以简单地获取每个句子的TOP_N单词。

from pyspark.sql.functions import explode, row_number, desc, col
from pyspark.sql.window import Window

input_df \
.select("id", explode("text_array").alias("word")) \
.groupBy("id", "word") \
.count() \
.withColumn("rn", row_number().over(Window.partitionBy("id").orderBy(desc("count")))) \
.filter(col("rn") <= 2) \
.show()

+---+-------------+-----+---+
| id|         word|count| rn|
+---+-------------+-----+---+
|  1| solar system|    4|  1|
|  1|         mars|    3|  2|
|  2|    milky way|    4|  1|
|  2|         moon|    2|  2|
+---+-------------+-----+---+

I haven't found a way to work with Sparse Vectors besides very few operations in the pyspark.ml.feature module so for something like taking the top n values I would say a UDF is the way to go.

The function below uses np.argpartition to find the top n values of vector values and return their indices which conveniently we can put in the vector indices to get the values.

import numpy as np
from pyspark.sql.functions import udf

@udf("array<integer>")
def get_top_n(v, n):
    top_n_indices = np.argpartition(v.values, -n)[-n:]
    return [int(x) for x in v.indices[top_n_indices]]

The values returned are the vocabulary index and not the actual word. If the vocabulary is not that big we can put it as an array column of its own and transform the idx to the actual word.

from pyspark.sql.functions import col, transform

voc = spark.createDataFrame([(cv_model.vocabulary,)], ["voc"])

cv_result \
.withColumn("top_2", get_top_n("cv_text", lit(2))) \
.crossJoin(voc) \
.withColumn("top_2_parsed", transform("top_2", lambda v: col("voc")[v])) \
.show() 

+---+--------------------+--------------------+--------------------+------+--------------------+--------------------+
| id|                text|          text_array|             cv_text| top_2|                 voc|        top_2_parsed|
+---+--------------------+--------------------+--------------------+------+--------------------+--------------------+
|  1|sun, mars, solar ...|[sun,  mars,  sol...|(9,[1,2,4,7],[4.0...|[2, 1]|[ milky way,  sol...|[ mars,  solar sy...|
|  2|planet, moon, mil...|[planet,  moon,  ...|(9,[0,2,3,5,6,8],...|[3, 0]|[ milky way,  sol...| [ moon,  milky way]|
+---+--------------------+--------------------+--------------------+------+--------------------+--------------------+

I'm not sure I feel that good with the solution above, probably not scalable.
That being said, if you don't actually need the CountVectorizer , there is a combination of standard functions we can do on the input_df to simply get the top_n words of every sentence.

from pyspark.sql.functions import explode, row_number, desc, col
from pyspark.sql.window import Window

input_df \
.select("id", explode("text_array").alias("word")) \
.groupBy("id", "word") \
.count() \
.withColumn("rn", row_number().over(Window.partitionBy("id").orderBy(desc("count")))) \
.filter(col("rn") <= 2) \
.show()

+---+-------------+-----+---+
| id|         word|count| rn|
+---+-------------+-----+---+
|  1| solar system|    4|  1|
|  1|         mars|    3|  2|
|  2|    milky way|    4|  1|
|  2|         moon|    2|  2|
+---+-------------+-----+---+

回复收藏 0 原文

~没有更多了~