计算列的所有行中2个单词组合的频率

发布于 2025-01-26 07:32:52 字数 2504 浏览 2 评论 0原文

我想在列的所有行中计算2个单词组合的频率。

我有一个带有两个列的表 - 第一个是带有句子的列，另一个是该句子的bigram令牌化。

句子	单词
“美好的一天”“美丽	的一天”
“美丽的一天”遭受了“	日子”的“美好的一天”，
“美丽的一天”遭受了	苦难
忍受	“遭受“无法
泪水”	的后背'
“无法阻止眼泪”	“后泪”
“灰烬后泪水美丽的一天”	“灰烬后''
灰色背泪美丽的一天	”“后泪”
泪”美丽的一天	'
“灰烬背日“	美丽的一天”

我所需的输出是一列，计算整个df ['句子']列中所有句子中单词的频率。类似的事情：

句子	单词	美丽
“美丽的一天”在“	的一天”	2
“美丽的一天”中遭受了“遭受的	“遭受”	1
'美丽的一天，“美丽的一天”遭受	了“	1
'无法阻止眼泪”的“遭受	的痛苦”。	1
'不能阻止眼泪'	“阻止”	1
'不能阻止眼泪'	返回眼泪'	2
“灰分泪水美丽	的日子” “灰分”	1'sh
ash aread tears teake	tebiled	'
teake '眼泪美丽的一天'	眼泪美丽的'	1
'灰背泪美丽的一天	“美丽的一天”	2

等。

我尝试过的代码重复第一个相同的频率，直到句子结束为止。

df.Sentence.str.count('|'.join(df.words.tolist()))

因此，不是我要寻找的东西，也需要很长时间，因为我的原始DF要大得多。

NLTK或任何其他库中是否有任何替代方案或任何功能？

原文

I want to count the frequency of 2 words combination in all the rows of a column.

I have a table with two columns - The first is a column with a sentence while the other is the bigram tokenization of that sentence.

Sentence	words
'beautiful day suffered through '	'beautiful day'
'beautiful day suffered through '	'day suffered'
'beautiful day suffered through '	'suffered through'
'cannot hold back tears '	'cannot hold'
'cannot hold back tears '	'hold back'
'cannot hold back tears '	'back tears'
'ash back tears beautiful day '	'ash back'
'ash back tears beautiful day '	'back tears'
'ash back tears beautiful day '	'tears beautiful'
'ash back tears beautiful day '	'beautiful day'

My desired output is a column counting the frequency of the words in all the sentences throughout the whole df['Sentence'] column.
Something like this:

Sentence	Words	Total
'beautiful day suffered through '	'beautiful day'	2
'beautiful day suffered through '	'day suffered'	1
'beautiful day suffered through '	'suffered through'	1
'cannot hold back tears '	'cannot hold'	1
'cannot hold back tears '	'hold back'	1
'cannot hold back tears '	'back tears'	2
'ash back tears beautiful day '	'ash back'	1
'ash back tears beautiful day '	'back tears'	2
'ash back tears beautiful day '	'tears beautiful'	1
'ash back tears beautiful day '	'beautiful day'	2

and so on.

The code I have tried repeats the first same frequency until the end of the sentence.

df.Sentence.str.count('|'.join(df.words.tolist()))

So not what I am looking for and it also takes a very long time as my original df is much larger.

Is there any alternative or any function in the NLTK or any other library?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薆情海 2025-02-02 07:32:52

我建议：

首先在两个句子的开头和结尾处删除引号和空格和words

data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()

然后set stoncences and code> and words < /code>作为字符串对象：

data = data.astype({"Sentence":str, "words": str})
print(data)

#Output
                          Sentence            words
0   beautiful day suffered through     beautiful day
1   beautiful day suffered through      day suffered
2   beautiful day suffered through  suffered through
3           cannot hold back tears       cannot hold
4           cannot hold back tears         hold back
5           cannot hold back tears        back tears
6     ash back tears beautiful day          ash back
7     ash back tears beautiful day        back tears
8     ash back tears beautiful day   tears beautiful
9     ash back tears beautiful day     beautiful day

计数同一行中句子中给定单词的出现，并存储在列中EG words_occur

def words_in_sent(row):
    return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)

最后groupby words并总结其出现：

data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)

结果

                          Sentence          words    words_occur total
0   beautiful day suffered through     beautiful day           1     2
1   beautiful day suffered through      day suffered           1     1
2   beautiful day suffered through  suffered through           1     1
3           cannot hold back tears       cannot hold           1     1
4           cannot hold back tears         hold back           1     1
5           cannot hold back tears        back tears           1     2
6     ash back tears beautiful day          ash back           1     1
7     ash back tears beautiful day        back tears           1     2
8     ash back tears beautiful day   tears beautiful           1     1
9     ash back tears beautiful day     beautiful day           1     2

I suggest:

Start by removing the quotes and whitespaces at the beginning and end of both Sentences and words

data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()

Then set Sentences and words as string objects:

data = data.astype({"Sentence":str, "words": str})
print(data)

#Output
                          Sentence            words
0   beautiful day suffered through     beautiful day
1   beautiful day suffered through      day suffered
2   beautiful day suffered through  suffered through
3           cannot hold back tears       cannot hold
4           cannot hold back tears         hold back
5           cannot hold back tears        back tears
6     ash back tears beautiful day          ash back
7     ash back tears beautiful day        back tears
8     ash back tears beautiful day   tears beautiful
9     ash back tears beautiful day     beautiful day

Count the occurrence of the given words in the sentence on the same row and store in a column e.g words_occur

def words_in_sent(row):
    return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)

Finally groupby words and sum up their occurrences:

data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)

Result

                          Sentence          words    words_occur total
0   beautiful day suffered through     beautiful day           1     2
1   beautiful day suffered through      day suffered           1     1
2   beautiful day suffered through  suffered through           1     1
3           cannot hold back tears       cannot hold           1     1
4           cannot hold back tears         hold back           1     1
5           cannot hold back tears        back tears           1     2
6     ash back tears beautiful day          ash back           1     1
7     ash back tears beautiful day        back tears           1     2
8     ash back tears beautiful day   tears beautiful           1     1
9     ash back tears beautiful day     beautiful day           1     2

回复收藏 0 原文

嘴硬脾气大 2025-02-02 07:32:52

我理解的方式是，您希望每个唯一句子中包含的bi-gram数量。单词列中已经存在的答案已经存在。 value_counts（）用于交付。

df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total')) 

                           Sentence             words  words_total
0  beautiful day suffered through       beautiful day            2
1  beautiful day suffered through        day suffered            1
2  beautiful day suffered through    suffered through            1
3          cannot hold back tears         cannot hold            1
4          cannot hold back tears           hold back            1
5          cannot hold back tears          back tears            2
6    ash back tears beautiful day            ash back            1
7    ash back tears beautiful day          back tears            2
8    ash back tears beautiful day     tears beautiful            1
9    ash back tears beautiful day       beautiful day            2

The way I understand it is that you want a bi-gram count as contained in each unique sentence. The answer for that already exists in the words column. value_counts() is used to deliver that.

df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total')) 

                           Sentence             words  words_total
0  beautiful day suffered through       beautiful day            2
1  beautiful day suffered through        day suffered            1
2  beautiful day suffered through    suffered through            1
3          cannot hold back tears         cannot hold            1
4          cannot hold back tears           hold back            1
5          cannot hold back tears          back tears            2
6    ash back tears beautiful day            ash back            1
7    ash back tears beautiful day          back tears            2
8    ash back tears beautiful day     tears beautiful            1
9    ash back tears beautiful day       beautiful day            2

回复收藏 0 原文

~没有更多了~