从文本生成关键字的简单方法是什么?
我想我可以获取一段文本并从中删除高频英语单词。 通过关键字,我的意思是我想提取最能表征文本内容的单词(标签)。 它不一定是完美的,一个好的近似值就可以满足我的需求。
有人做过类似的事吗? 您知道有 Perl 或 Python 库可以做到这一点吗?
Lingua::EN::Tagger 正是我所要求的,但是我需要一个也可以用于法语文本的库。
I suppose I could take a text and remove high frequency English words from it. By keywords, I mean that I want to extract words that are most the characterizing of the content of the text (tags ) . It doesn't have to be perfect, a good approximation is perfect for my needs.
Has anyone done anything like that? Do you known a Perl or Python library that does that?
Lingua::EN::Tagger is exactly what I asked however I needed a library that could work for french text too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
“高频英语单词”的名称是停用词,并且有许多可用列表。 我不知道有任何 python 或 perl 库,但是您可以在二叉树或散列中对停用词列表进行编码(或者您可以使用 python 的 freezeset),然后当您从输入文本中读取每个单词时,检查它是否是在您的“停止列表”中并将其过滤掉。
请注意,删除停用词后,您需要执行一些词干提取来规范化结果文本(删除复数、-ings、-eds),然后删除所有重复的“关键字”。
The name for the "high frequency English words" is stop words and there are many lists available. I'm not aware of any python or perl libraries, but you could encode your stop word list in a binary tree or hash (or you could use python's frozenset), then as you read each word from the input text, check if it is in your 'stop list' and filter it out.
Note that after you remove the stop words, you'll need to do some stemming to normalize the resulting text (remove plurals, -ings, -eds), then remove all the duplicate "keywords".
您可以尝试使用 perl 模块 Lingua::EN::Tagger 快速简单的解决方案。
更复杂的模块Lingua::EN::Semtags::Engine将 Lingua::EN::Tagger 与 WordNet 数据库结合使用以获得更结构化的输出。 两者都非常易于使用,只需在安装模块后查看 CPAN 上的文档或使用 perldoc 即可。
You could try using the perl module Lingua::EN::Tagger for a quick and easy solution.
A more complicated module Lingua::EN::Semtags::Engine uses Lingua::EN::Tagger with a WordNet database to get a more structured output. Both are pretty easy to use, just check out the documentation on CPAN or use perldoc after you install the module.
要查找文本中最常用的单词,请执行以下操作:
示例输出如下所示:
To find the most frequently-used words in a text, do something like this:
Example output looks like this:
在 Perl 中,有 Lingua::EN::Keywords。
In Perl there's Lingua::EN::Keywords.
我认为仍然保持简单性的最准确方法是计算源中的词频,然后根据它们在常见英语(或任何其他语言)使用中的频率对它们进行加权。
常用的单词(例如“coffeehouse”)比出现频率较高的单词(例如“dog”)更有可能成为关键字。 不过,如果您的消息来源提到“dog”500 次,“coffeehouse”两次,则“dog”更有可能是一个关键字,即使它是一个常用词。
决定权重方案将是困难的部分。
I think the most accurate way that still maintains a semblance of simplicity would be to count the word frequencies in your source, then weight them according to their frequencies in common English (or whatever other language) usage.
Words that appear less frequently in common use, like "coffeehouse" are more likely to be a keyword than words that appear more often, like "dog." Still, if your source mentions "dog" 500 times and "coffeehouse" twice it's more likely that "dog" is a keyword even though it's a common word.
Deciding on the weighting scheme would be the difficult part.
TF-IDF(词频-逆文档频率)就是为此而设计的。
基本上它会问,与所有文档相比,哪些单词在此文档中出现频率最高?
它会对所有文档中出现的单词给予较低的分数,而对给定文档中频繁出现的单词给予较高的分数。
您可以在此处查看计算工作表:
https://docs.google .com/spreadsheet/ccc?key=0AreO9JhY28gcdFMtUFJrc0dRdkpiUWlhNHVGS1h5Y2c&usp=sharing
(切换到底部的 TFIDF 选项卡)
这是一个 python 库:
https://github.com/hrs/python-tf-idf
TF-IDF (Term Frequency - Inverse Document Frequency) is designed for this.
Basically it asks, which words are frequent in this document, compared to all documents?
It will give a lower score to words that appear in all documents, and a higher score to words that appear in a given document frequently.
You can see a worksheet of the calculations here:
https://docs.google.com/spreadsheet/ccc?key=0AreO9JhY28gcdFMtUFJrc0dRdkpiUWlhNHVGS1h5Y2c&usp=sharing
(switch to TFIDF tab at bottom)
Here is a python library:
https://github.com/hrs/python-tf-idf
做你想做的事情的最简单的方法是这样...
我不知道有任何标准模块可以做到这一点,但是通过查找一组常见英语来替换三个字母单词的限制并不难字。
The simplest way to do what you want is this...
I don't know of any standard module that does this, but it wouldn't be hard to replace the limit on three letter words with a lookup into a set of common English words.
一种线性解决方案(超过两个字符且出现两次以上的单词):
编辑:如果想按字母顺序对具有相同频率的单词进行排序,可以使用此增强型解决方案:
One liner solution (words longer than two chars which occurred more than two times):
EDIT: If one wants to sort alphabetically words with same frequency can use this enhanced one: