用于文本分析的算法或库,特别是:文本中的主导词、短语和文本集合
我正在开展一个项目,需要分析一页文本和文本页面集合以确定主导词。 我想知道是否有一个库(更喜欢 c# 或 java)可以为我处理繁重的工作。 如果没有,是否有一个或多个算法可以实现我的以下目标。
我想要做的类似于根据您在网络上找到的 url 或 rss feed 构建的文字云,但我不想要可视化。 它们一直被用来分析总统候选人的演讲,以了解主题或最常用的词语是什么。
复杂的是,我需要对数千个简短文档以及这些文档的集合或类别执行此操作。
我最初的计划是解析文档,然后过滤常见单词 - of、the、he、she 等。然后计算剩余单词在文本(以及总体集合/类别)中出现的次数。
问题是,将来我想处理词干、复数形式等。我还想看看是否有一种方法可以识别重要的短语。 (不是对单词进行计数,而是对短语进行 2-3 个单词的计数)
任何有帮助的策略、库或算法的指导都将受到赞赏。
I'm working on a project where I need to analyze a page of text and collections of pages of text to determine dominant words. I'd like to know if there is a library (prefer c# or java) that will handle the heavy lifting for me. If not, is there an algorithm or multiple that would achieve my goals below.
What I want to do is similar to word clouds built from a url or rss feed that you find on the web, except I don't want the visualization. They are used all the time for analyzing the presidential candidate speeches to see what the theme or most used words are.
The complication, is that I need to do this on thousands of short documents, and then collections or categories of these documents.
My initial plan was to parse the document out, then filter common words - of, the, he, she, etc.. Then count the number of times the remaining words show up in the text (and overall collection/category).
The problem is that in the future, I would like to handle stemming, plural forms, etc.. I would also like to see if there is a way to identify important phrases. (Instead of a count of a word, the count of a phrase being 2-3 words together)
Any guidance on a strategy, libraries or algorithms that would help are appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您正在做的事情的一种选择是术语频率到逆文档频率,或 tf-idf。 在此计算下,最强的项将具有最高的权重。 在这里查看是否:http://en.wikipedia.org/wiki/Tf-idf
另一种选择是使用类似朴素贝叶斯分类器的东西,使用单词作为特征,并找到文本中最强的特征来确定文档的类别。 这与最大熵分类器的工作原理类似。
就执行此操作的工具而言,最好的入门工具是 NLTK,这是一个包含大量文档和教程的 Python 库:http ://nltk.sourceforge.net/
对于 Java,请尝试 OpenNLP:http://opennlp.sourceforge.net /
对于短语 stuff,请考虑我提供的第二个选项,即使用二元组和三元组作为特征,甚至作为 tf-idf 中的术语。
祝你好运!
One option for what you're doing is term frequency to inverse document frequency, or tf-idf. The strongest terms will have the highest weighting under this calculation. Check if out here: http://en.wikipedia.org/wiki/Tf-idf
Another option is to use something like a naive bayes classifier using words as features and find what the strongest features are in the text to determine the class of the document. This would work similarly with a maximum entropy classifier.
As far as tools to do this, the best tool to start with would be NLTK, a Python library with extensive documentation and tutorials: http://nltk.sourceforge.net/
For Java, try OpenNLP: http://opennlp.sourceforge.net/
For the phrase stuff, consider the second option I offered up by using bigrams and trigrams as features, or even as terms in tf-idf.
Good luck!
对罗伯特·埃尔韦尔的答案进行补充:
这些东西都不是明确的,也没有“正确答案”。 另请参阅“nlp”和“自然语言”SO 标签。
祝你好运! 这是一个不平凡的项目。
To add to Robert Elwell's answer:
None of this stuff is clear cut, nor does any of it have "correct answers". See also the "nlp" and "natural-language" SO tags.
Good luck! This is a non-trivial project.
好吧。 现在您已经有了一个包含文本的文档和一个文档集合(语料库)。 有多种方法可以做到这一点。
我建议使用 Lucene 引擎 (Java) 来索引您的文档。 Lucene 支持一种数据结构(索引),在其中维护许多文档。 文档本身是一种数据结构,可以包含“字段”,例如作者、标题、文本等。您可以选择哪些字段被索引,哪些字段不被索引。
将文档添加到索引微不足道。 Lucene 也是为了速度而构建的,并且可以出色地扩展。
接下来,您想要找出术语和频率。 由于 lucene 已经在索引过程中为您计算了这一点,因此您可以使用 docFreq 函数并构建您自己的术语频率函数,或者使用 IndexReader 类的 getTermFreqVectors 函数来获取术语(及其频率)。
现在由您决定如何对其进行排序以及您想要使用什么标准来过滤您想要的单词。 要弄清楚关系,您可以使用 wordnet 开源库的 Java API。 要对单词进行词干提取,请使用 Lucene 的 PorterStemFilter 类。 短语重要性部分比较棘手,但是一旦您了解了这一步,您就可以搜索有关如何将 n-gram 搜索集成到 Lucene 中的提示 (提示)。
祝你好运!
Alrighty. So you've got a document containing text and a collection of documents (a corpus). There are a number of ways to do this.
I would suggest using the Lucene engine (Java) to index your documents. Lucene supports a data structure (Index) that maintains a number of documents in it. A document itself is a data structure that can contain "fields" - say, author, title, text, etc. You can choose which fields are indexed and which ones are not.
Adding documents to an index is trivial. Lucene is also built for speed, and can scale superbly.
Next, you want to figure out the terms and the frequencies. Since lucene has already calculated this for you during the indexing process, you can use either the docFreq function and build your own term frequency function, or use the IndexReader class's getTermFreqVectors function to get the terms (and frequencies thereof).
Now its up to you how to sort it and what criteria you want to use to filter the words you want. To figure out relationships, you can use a Java API to the wordnet open source library. To stem words, use Lucene's PorterStemFilter class. The phrase importance part is trickier, but once you've gotten this far - you can search for tips on how to integrate n-gram searching into Lucene (hint).
Good luck!
您可以使用Windows Indexing Service,它随Windows Platform SDK 一起提供。 或者,只需阅读以下介绍即可了解 NLP 的概述。
http://msdn.microsoft.com/en-us /library/ms693179(VS.85).aspx
http://i.msdn .microsoft.com/ms693179.wbr-index-create(zh-cn,VS.85).gif
http://i.msdn.microsoft.com/ms693179.wbr-query-process(zh-cn,VS.85).gif
You could use Windows Indexing Service, which comes with the Windows Platform SDK. Or, just read the following introduction to get an overview of NLP.
http://msdn.microsoft.com/en-us/library/ms693179(VS.85).aspx
http://i.msdn.microsoft.com/ms693179.wbr-index-create(en-us,VS.85).gif
http://i.msdn.microsoft.com/ms693179.wbr-query-process(en-us,VS.85).gif
检查 MapReduce 模型以获取字数统计,然后导出频率,如 tf-idf 中所述。
Hadoop 是一个 apache MapReduce 框架,可用于对许多文档进行字数统计的繁重任务。
http://hadoop.apache.org/common/docs/current/mapred_tutorial。 html
您无法获得一个可以解决您想要的所有问题的框架。 您必须选择概念和框架的正确组合才能获得您想要的东西。
Check MapReduce model to get the word count and then derive the frequency as described in tf-idf
Hadoop is a apache MapReduce framework that can be used for the heavy lifting task of word count on many documents.
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
You cannot get a single framework that would solve all you want. You have to choose a right combination of concepts and framework to get what you want.
您问题的这一部分称为 搭配提取。 (至少如果您将“重要短语”视为出现频率显着高于偶然出现的短语。)我 在 关于该特定子问题的另一个 SO 问题。
This part of your problem is called collocation extraction. (At least if you take 'important phrases' to be phrases that appear significantly more often than by chance.) I gave an answer over at another SO question about that specific subproblem.
看来您正在寻找的是所谓的词袋文档聚类/分类。
您将找到有关此搜索的指导。
It seems that what you are looking for is called bag-of-words document clustering/classification.
You will find guidance with this search.