pyspark中的角色级tfidf？

发布于 2025-01-24 20:41:11 字数 1039 浏览 2 评论 0 原文

我有一个由10m行组成的数据集：

>>> df

    name            job                                     company
0   Amanda Arroyo   Herbalist                               Norton-Castillo
1   Victoria Brown  Outdoor activities/education manager    Bowman-Jensen
2   Amy Henry       Chemist, analytical                     Wilkerson, Guerrero and Mason

而且我想计算列 name 的3-gram字符级tfidf矢量，就像我很容易使用Sklearn：

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
X = tfidf.fit_transform(df['name'])

问题是我可以是我可以't在 Spark Documentation < /a>或in 文档。

Pyspark是否可以实现这一目标？

原文

I have a dataset that consists in 10M rows:

>>> df

    name            job                                     company
0   Amanda Arroyo   Herbalist                               Norton-Castillo
1   Victoria Brown  Outdoor activities/education manager    Bowman-Jensen
2   Amy Henry       Chemist, analytical                     Wilkerson, Guerrero and Mason

And I want to calculate the 3-gram character-level tfidf vectors for the column name, like I would easily do with sklearn:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
X = tfidf.fit_transform(df['name'])

The problem is that I can't see any reference to it in the Spark documentation or in the HashingTF API docs.

Is this achievable at all with PySpark?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟沫凡尘 2025-01-31 20:41:12

这些工具可用：

tfidf spark vs sklean 和

是的，这是可以实现的。

字符为。

  df = spark.createDataframe（[[（“ ab c”，）]，[“ text”]）
tokenizer = tokenizer（outputCol =“ words”）
tokenizer.setinputcol（“文本”）
令牌...
tokenizer.transform（df）.head（）
row（text ='ab c'，单词= ['a'，'b'，'c']）

The tools are available:

TFIDF Spark vs SKlean and ngrams.

Yes it is achievable.

Example of characters being tokenized.

df = spark.createDataFrame([("a b c",)], ["text"])
tokenizer = Tokenizer(outputCol="words")
tokenizer.setInputCol("text")
Tokenizer...
tokenizer.transform(df).head()
Row(text='a b c', words=['a', 'b', 'c'])

回复收藏 0 原文

~没有更多了~