如何在文本中查找搭配,python

发布于 2024-10-01 09:56:44 字数 288 浏览 0 评论 0原文

如何在文本中找到搭配? 搭配是异常频繁地一起出现的一系列单词。 python 有内置的 func 二元组,可以返回单词对。

>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

剩下的就是根据单个单词的频率找到更频繁出现的二元组。有什么想法如何将其放入代码中吗?

How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.

>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

无悔心 2024-10-08 09:56:44

尝试 NLTK。您最感兴趣的是 nltk.collocations.BigramCollocationFinder,但这里有一个快速演示,向您展示如何开始:

>>> import nltk
>>> def tokenize(sentences):
...     for sent in nltk.sent_tokenize(sentences.lower()):
...         for word in nltk.word_tokenize(sent):
...             yield word
... 

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

这个小部分中没有,但这里是:

>>> text.collocations(num=20)
Building collocations list

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:

>>> import nltk
>>> def tokenize(sentences):
...     for sent in nltk.sent_tokenize(sentences.lower()):
...         for word in nltk.word_tokenize(sent):
...             yield word
... 

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

There are none in this small segment, but here goes:

>>> text.collocations(num=20)
Building collocations list
幸福丶如此 2024-10-08 09:56:44

下面是一些代码,它采用小写单词列表并返回所有二元组及其各自计数的列表,从最高计数开始。不要将此代码用于大型列表。

from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
    count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)

(引入 words_iter 是为了避免像在 izip(words, Words[1:]) 中那样复制整个单词列表

Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.

from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
    count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)

(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])

南…巷孤猫 2024-10-08 09:56:44
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)
无戏配角 2024-10-08 09:56:44

搭配是一系列标记,在解析时最好将其视为单个标记,例如“红鲱鱼”具有无法从其组件派生的含义。从语料库中导出一组有用的搭配需要通过一些统计数据(n 元词频率、互信息、对数似然等)对 n 元词组进行排序,然后进行明智的手动编辑。

您似乎忽略了以下几点:

(1)语料库必须相当大……尝试从一个句子中获取您似乎建议的搭配是毫无意义的。

(2) n 可以大于 2...例如,分析有关 20 世纪中国历史的文本将会出现“重要”的二元词,例如“毛泽东”和“谢东”。

你到底想达到什么目的?到目前为止你写了哪些代码?

A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.

Points that you appear to be ignoring:

(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.

(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".

What are you actually trying to achieve? What code have you written so far?

因为看清所以看轻 2024-10-08 09:56:44

同意 Tim McNamara 关于使用 nltk 和 unicode 问题的观点。但是,我非常喜欢文本类 - 有一个技巧可以用来获取搭配列表,我在查看 源代码 。显然,每当您调用搭配方法时,它都会将其保存为类变量!

    import nltk
    def tokenize(sentences):
        for sent in nltk.sent_tokenize(sentences.lower()):
            for word in nltk.word_tokenize(sent):                 
                yield word


    text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
    text.collocations(num=20)
    collocations = [" ".join(el) for el in list(text._collocations)]

享受 !

Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!

    import nltk
    def tokenize(sentences):
        for sent in nltk.sent_tokenize(sentences.lower()):
            for word in nltk.word_tokenize(sent):                 
                yield word


    text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
    text.collocations(num=20)
    collocations = [" ".join(el) for el in list(text._collocations)]

enjoy !

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文