如何分组并获得最频繁的ngram?
我的数据帧看起来像这样:
ID topics text
1 1 twitter is my favorite social media
2 1 favorite social media
3 2 rt twitter tomorrow
4 3 rt facebook today
5 3 rt twitter
6 4 vote for the best twitter
7 2 twitter tomorrow
8 4 best twitter
我想按主题进行分组并使用count vectorizer(我真的更喜欢使用CountVectorize,因为它允许以多种语言删除停止单词,并且可以设置3,4克的范围)来计算最频繁的范围Bigrams。在获得最频繁的Bigram之后,我想创建一个称为“ Biagram”的新列,并将每个主题最常见的BigRam分配给该列。
我希望我的输出看起来像这样。
ID topics text biagram
1 1 twitter is my favorite social favorite social
2 1 favorite social media favorite social
3 2 rt twitter tomorrow twitter tomorrow
4 2 twitter tomorrow twitter tomorrow
5 3 rt twitter rt twitter
6 3 rt facebook today rt twitter
7 4 vote for the bes twitter best twitter
8 4 best twitter best twitter
请注意,“主题”列不需要按主题订立。在创建此帖子时,我订购了为了可视化。
该代码将在6m行数据上运行,因此需要快速。
使用熊猫的最佳方法是什么?如果看起来太复杂,我深表歉意。
My dataframe looks like this:
ID topics text
1 1 twitter is my favorite social media
2 1 favorite social media
3 2 rt twitter tomorrow
4 3 rt facebook today
5 3 rt twitter
6 4 vote for the best twitter
7 2 twitter tomorrow
8 4 best twitter
I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.
I want my output to look like this.
ID topics text biagram
1 1 twitter is my favorite social favorite social
2 1 favorite social media favorite social
3 2 rt twitter tomorrow twitter tomorrow
4 2 twitter tomorrow twitter tomorrow
5 3 rt twitter rt twitter
6 3 rt facebook today rt twitter
7 4 vote for the bes twitter best twitter
8 4 best twitter best twitter
Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.
This code will be run on 6M rows of data, so it needs to be fast.
What is the best way to do it using pandas? I apologize if it seems too complicated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
更新
您可以使用
sklearn
:更新2
旧答案
您可以使用
nltk
:Update
You can use
sklearn
:Update 2
Old answer
You can use
nltk
: