如何分组并获得最频繁的ngram？

发布于 2025-02-11 17:02:09 字数 1247 浏览 2 评论 0原文

我的数据帧看起来像这样：

ID topics   text
1     1        twitter is my favorite social media
2     1        favorite social media
3     2        rt twitter tomorrow
4     3        rt facebook today
5     3        rt twitter
6     4        vote for the best twitter
7     2        twitter tomorrow
8     4        best twitter

我想按主题进行分组并使用count vectorizer（我真的更喜欢使用CountVectorize，因为它允许以多种语言删除停止单词，并且可以设置3，4克的范围）来计算最频繁的范围Bigrams。在获得最频繁的Bigram之后，我想创建一个称为“ Biagram”的新列，并将每个主题最常见的BigRam分配给该列。

我希望我的输出看起来像这样。

ID topics      text                                 biagram
1     1        twitter is my favorite social       favorite social
2     1        favorite social media               favorite  social
3     2        rt twitter tomorrow                 twitter tomorrow
4     2        twitter tomorrow                    twitter tomorrow
5     3        rt twitter                          rt twitter
6     3        rt facebook today           rt twitter 
7     4        vote for the bes twitter               best twitter
8     4        best twitter                        best twitter

请注意，“主题”列不需要按主题订立。在创建此帖子时，我订购了为了可视化。

该代码将在6m行数据上运行，因此需要快速。

使用熊猫的最佳方法是什么？如果看起来太复杂，我深表歉意。

原文

My dataframe looks like this:

ID topics   text
1     1        twitter is my favorite social media
2     1        favorite social media
3     2        rt twitter tomorrow
4     3        rt facebook today
5     3        rt twitter
6     4        vote for the best twitter
7     2        twitter tomorrow
8     4        best twitter

I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.

I want my output to look like this.

ID topics      text                                 biagram
1     1        twitter is my favorite social       favorite social
2     1        favorite social media               favorite  social
3     2        rt twitter tomorrow                 twitter tomorrow
4     2        twitter tomorrow                    twitter tomorrow
5     3        rt twitter                          rt twitter
6     3        rt facebook today           rt twitter 
7     4        vote for the bes twitter               best twitter
8     4        best twitter                        best twitter

Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.

This code will be run on 6M rows of data, so it needs to be fast.

What is the best way to do it using pandas? I apologize if it seems too complicated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

阳光①夏 2025-02-18 17:02:09

更新

您可以使用sklearn：

trom sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
data = vect.fit_transform(df['text'])
bigram = (pd.DataFrame(data=data.toarray(),
                       index=df['topics'],
                       columns=vect.get_feature_names_out())
            .groupby('topics').sum().idxmax(axis=1))
df['bigram'] = df['topics'].map(bigram)
print(df)

# Output
   ID  topics                                 text            bigram
0   1       1  twitter is my favorite social media   favorite social
1   2       1                favorite social media   favorite social
2   3       2                  rt twitter tomorrow  twitter tomorrow
3   4       3                    rt facebook today    facebook today
4   5       3                           rt twitter    facebook today
5   6       4            vote for the best twitter      best twitter
6   7       2                     twitter tomorrow  twitter tomorrow
7   8       4                         best twitter      best twitter

更新2

如果我想要3个最频繁的ngrams怎么样。我可以使用什么代替idxmax（）？

most_common3 = lambda x: x.sum().nlargest(3).index.to_frame(index=False).squeeze()
bigram = (pd.DataFrame(data=data.toarray(),
                       index=df['topics'],
                       columns=vect.get_feature_names_out())
            .groupby('topics').apply(most_common3)
            .rename(columns=lambda x: f"bigram{x+1}").reset_index())
df = df.merge(bigram, on='topics')
print(df)

# Output
   topics                                 text           bigram1       bigram2           bigram3
0       1  twitter is my favorite social media   favorite social  social media  twitter favorite
1       1                favorite social media   favorite social  social media  twitter favorite
2       2                  rt twitter tomorrow  twitter tomorrow    rt twitter      best twitter
3       2                     twitter tomorrow  twitter tomorrow    rt twitter      best twitter
4       3                    rt facebook today    facebook today   rt facebook        rt twitter
5       3                           rt twitter    facebook today   rt facebook        rt twitter
6       4            vote for the best twitter      best twitter     vote best    facebook today
7       4                         best twitter      best twitter     vote best    facebook today

旧答案

您可以使用nltk：

import nltk

to_bigram = lambda x: list(nltk.bigrams(x.split()))
most_common = (df.set_index('topics')['text'].map(to_bigram)
                 .groupby(level=0).apply(lambda x: x.mode()[0][0]))

df['bigram'] = df['topics'].map(most_common)
print(df)

# Output
   ID  topics                                 text              bigram
0   1       1  twitter is my favorite social media  (favorite, social)
1   2       1                favorite social media  (favorite, social)
2   3       2                  rt twitter tomorrow       (rt, twitter)
3   4       3                    rt facebook today      (rt, facebook)
4   5       3                           rt twitter      (rt, facebook)
5   6       4            vote for the best twitter     (best, twitter)
6   7       2                     twitter tomorrow       (rt, twitter)
7   8       4                         best twitter     (best, twitter)

Update

You can use sklearn:

trom sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
data = vect.fit_transform(df['text'])
bigram = (pd.DataFrame(data=data.toarray(),
                       index=df['topics'],
                       columns=vect.get_feature_names_out())
            .groupby('topics').sum().idxmax(axis=1))
df['bigram'] = df['topics'].map(bigram)
print(df)

# Output
   ID  topics                                 text            bigram
0   1       1  twitter is my favorite social media   favorite social
1   2       1                favorite social media   favorite social
2   3       2                  rt twitter tomorrow  twitter tomorrow
3   4       3                    rt facebook today    facebook today
4   5       3                           rt twitter    facebook today
5   6       4            vote for the best twitter      best twitter
6   7       2                     twitter tomorrow  twitter tomorrow
7   8       4                         best twitter      best twitter

Update 2

how about if I want the 3 most frequent ngrams. What can I use instead of idxmax()?

most_common3 = lambda x: x.sum().nlargest(3).index.to_frame(index=False).squeeze()
bigram = (pd.DataFrame(data=data.toarray(),
                       index=df['topics'],
                       columns=vect.get_feature_names_out())
            .groupby('topics').apply(most_common3)
            .rename(columns=lambda x: f"bigram{x+1}").reset_index())
df = df.merge(bigram, on='topics')
print(df)

# Output
   topics                                 text           bigram1       bigram2           bigram3
0       1  twitter is my favorite social media   favorite social  social media  twitter favorite
1       1                favorite social media   favorite social  social media  twitter favorite
2       2                  rt twitter tomorrow  twitter tomorrow    rt twitter      best twitter
3       2                     twitter tomorrow  twitter tomorrow    rt twitter      best twitter
4       3                    rt facebook today    facebook today   rt facebook        rt twitter
5       3                           rt twitter    facebook today   rt facebook        rt twitter
6       4            vote for the best twitter      best twitter     vote best    facebook today
7       4                         best twitter      best twitter     vote best    facebook today

Old answer

You can use nltk:

import nltk

to_bigram = lambda x: list(nltk.bigrams(x.split()))
most_common = (df.set_index('topics')['text'].map(to_bigram)
                 .groupby(level=0).apply(lambda x: x.mode()[0][0]))

df['bigram'] = df['topics'].map(most_common)
print(df)

# Output
   ID  topics                                 text              bigram
0   1       1  twitter is my favorite social media  (favorite, social)
1   2       1                favorite social media  (favorite, social)
2   3       2                  rt twitter tomorrow       (rt, twitter)
3   4       3                    rt facebook today      (rt, facebook)
4   5       3                           rt twitter      (rt, facebook)
5   6       4            vote for the best twitter     (best, twitter)
6   7       2                     twitter tomorrow       (rt, twitter)
7   8       4                         best twitter     (best, twitter)

回复收藏 0 原文

~没有更多了~