如何在文本中查找搭配,python
如何在文本中找到搭配? 搭配是异常频繁地一起出现的一系列单词。 python 有内置的 func 二元组,可以返回单词对。
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
剩下的就是根据单个单词的频率找到更频繁出现的二元组。有什么想法如何将其放入代码中吗?
How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
尝试 NLTK。您最感兴趣的是
nltk.collocations.BigramCollocationFinder
,但这里有一个快速演示,向您展示如何开始:这个小部分中没有,但这里是:
Try NLTK. You will mostly be interested in
nltk.collocations.BigramCollocationFinder
, but here is a quick demonstration to show you how to get started:There are none in this small segment, but here goes:
下面是一些代码,它采用小写单词列表并返回所有二元组及其各自计数的列表,从最高计数开始。不要将此代码用于大型列表。
(引入
words_iter
是为了避免像在izip(words, Words[1:])
中那样复制整个单词列表Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
(
words_iter
is introduced to avoid copying the whole list of words as you would do inizip(words, words[1:])
搭配是一系列标记,在解析时最好将其视为单个标记,例如“红鲱鱼”具有无法从其组件派生的含义。从语料库中导出一组有用的搭配需要通过一些统计数据(n 元词频率、互信息、对数似然等)对 n 元词组进行排序,然后进行明智的手动编辑。
您似乎忽略了以下几点:
(1)语料库必须相当大……尝试从一个句子中获取您似乎建议的搭配是毫无意义的。
(2) n 可以大于 2...例如,分析有关 20 世纪中国历史的文本将会出现“重要”的二元词,例如“毛泽东”和“谢东”。
你到底想达到什么目的?到目前为止你写了哪些代码?
A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?
同意 Tim McNamara 关于使用 nltk 和 unicode 问题的观点。但是,我非常喜欢文本类 - 有一个技巧可以用来获取搭配列表,我在查看 源代码 。显然,每当您调用搭配方法时,它都会将其保存为类变量!
享受 !
Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
enjoy !