自然语言处理的词频算法
在没有获得信息检索学位的情况下,我想知道是否存在任何算法来计算给定文本正文中单词出现的频率。 目标是对人们在一组文本评论中所说的话有一个“总体感觉”。 沿着 Wordle 的路线。
我想要的是:
- 忽略冠词、代词等(“a”、“an”、“the”、“him”、“them”等)
- 保留专有名词
- 忽略连字符,除了软类型的
摘星,这些会是桃色的:
- 处理词干和 复数(例如,喜欢,喜欢,喜欢,喜欢匹配相同的结果)
- 形容词(副词等)与其主语的分组(“伟大的服务”而不是“伟大的”,“服务”)
我尝试了一些基本的东西使用Wordnet,但我只是盲目地调整东西并希望它适用于我的特定数据。 更通用的东西会很棒。
Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.
What I'd like:
- ignore articles, pronouns, etc ('a', 'an', 'the', 'him', 'them' etc)
- preserve proper nouns
- ignore hyphenation, except for soft kind
Reaching for the stars, these would be peachy:
- handling stemming & plurals (e.g. like, likes, liked, liking match the same result)
- grouping of adjectives (adverbs, etc) with their subjects ("great service" as opposed to "great", "service")
I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
你刚才描述的算法。 一个开箱即用的程序,带有一个大按钮,上面写着“做”……我不知道。
但让我说点建设性的。 我向您推荐这本书集体智慧编程。 第 3 章和第 4 章包含非常实用的示例(实际上,没有复杂的理论,只是示例)。
The algorithm you just described it. A program that does it out of the box with a big button saying "Do it"... I don't know.
But let me be constructive. I recommend you this book Programming Collective Intelligence. Chapters 3 and 4 contain very pragmatic examples (really, no complex theories, just examples).
您可以使用世界网词典来获取问题关键字的基本信息,例如其过去的言论,提取同义词,您也可以对您的文档执行相同的操作,为其创建索引。
然后您可以轻松地将关键字与索引文件匹配并对文档进行排名。 然后对其进行总结。
U can use the worldnet dictionary to the get the basic information of the question keyword like its past of speech, extract synonym, u can also can do the same for your document to create the index for it.
then you can easily match the keyword with index file and rank the document. then summerize it.
spacy 可以很好地处理您列出的所有内容。
如果主题列表是预先确定的并且不是很大,您甚至可以更进一步:构建一个可以预测主题的分类模型。
假设您有 10 个科目。 您收集例句或文本。 您将它们加载到另一个产品中:prodigy。 使用它出色的界面,您可以快速将主题分配给样本。 最后,使用分类样本训练 spacy 模型来预测文本或句子的主题。
Everything what you have listed is handled well by spacy.
If the list of topics is pre-determined and not huge, you may even go further: build a classification model that will predict the topic.
Let's say you have 10 subjects. You collect sample sentences or texts. You load them into another product: prodigy. Using it's great interface you quickly assign subjects to the samples. And finally, using the categorized samples you train the spacy model to predict the subject of the texts or sentences.
您需要的不是一种,而是几种不错的算法,如下所示。
对不起,我知道你说你想KISS,但不幸的是,你的要求没那么容易满足。 尽管如此,所有这些都存在工具,并且您应该能够将它们捆绑在一起,而不必自己执行任何任务(如果您不想)。 如果你想自己执行一项任务,我建议你看看词干,这是最简单的。
如果您使用 Java,请将 Lucene 与 OpenNLP 工具包。 你会得到非常好的结果,因为 Lucene 已经内置了一个词干分析器和很多教程。 另一方面,OpenNLP 工具包的文档很少,但您不需要太多。 您可能还对用 Python 编写的 NLTK 感兴趣。
我想说你放弃最后一个要求,因为它涉及浅层解析,并且绝对不会改善你的结果。
啊,顺便说一句。 您要查找的文档术语频率事物的确切术语称为 tf-idf。 这几乎是查找术语文档频率的最佳方法。 为了正确地做到这一点,您将无法回避使用多维向量矩阵。
... 是的,我知道。 参加完关于 IR 的研讨会后,我对 Google 的敬意更加深了。 不过,在 IR 做了一些工作之后,我对他们的尊重也很快就消失了。
You'll need not one, but several nice algorithms, along the lines of the following.
I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.
If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.
I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.
Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.
... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.
以下是如何在 Python 中执行此操作的示例,这些概念在任何语言中都是类似的。
第一行只是获取有助于解决部分问题的库,如第二行所示,urllib2 下载了 Ambrose Bierce 的“魔鬼词典”的副本。接下来的几行列出了文本中所有单词的列表,不带标点符号。 然后创建一个哈希表,在本例中它就像与数字关联的唯一单词的列表。 for 循环遍历 Bierce 书中的每个单词,如果表中已经存在该单词的记录,则每个新出现的记录都会为表中与该单词关联的值加一; 如果该单词尚未出现,则会将其添加到表中,值为 1(表示出现一次。)对于您正在讨论的情况,您需要更加注意细节,例如使用大写仅帮助识别句子中间的专有名词等,这很粗糙,但表达了概念。
为了深入研究词干和复数化的东西,进行实验,然后研究第 3 方的工作,我喜欢 NLTK 的一部分,这是一个学术开源项目,也是用 Python 编写的。
Here is an example of how you might do that in Python, the concepts are similar in any language.
The first line just gets libraries that help with parts of the problem, as in the second line, where urllib2 downloads a copy of Ambrose Bierce's "Devil's Dictionary" The next lines make a list of all the words in the text, without punctuation. Then you create a hash table, which in this case is like a list of unique words associated with a number. The for loop goes over each word in the Bierce book, if there is already a record of that word in the table, each new occurrence adds one to the value associated with that word in the table; if the word hasn't appeared yet, it gets added to the table, with a value of 1 (meaning one occurrence.) For the cases you are talking about, you would want to pay much more attention to detail, for example using capitalization to help identify proper nouns only in the middle of sentences, etc., this is very rough but expresses the concept.
To get into the stemming and pluralization stuff, experiment, then look into 3rd party work, I have enjoyed parts of the NLTK, which is an academic open source project, also in python.
我不久前编写了一个完整的程序来做到这一点。 稍后我回家后可以上传演示。
这是代码(asp.net/c#): http://naspinski.net/post/Findingcounting-Keywords-out-of-a-Text-Document.aspx
I wrote a full program to do just this a while back. I can upload a demo later when I get home.
Here is a the code (asp.net/c#): http://naspinski.net/post/Findingcounting-Keywords-out-of-a-Text-Document.aspx
你问题的第一部分听起来没那么糟糕。 您基本上需要做的就是从文件(或流)中读取每个单词并将其放入前缀树中,每次遇到已经存在的单词时,您都会增加与其关联的值。 当然,您还会有一个忽略列表,其中包含您希望在计算中排除的所有内容。
如果您使用前缀树,您可以确保找到任何单词的时间复杂度为 O(N),其中 N 是数据集中单词的最大长度。 在这种情况下,前缀树的优点是,如果您想查找复数和词干,您可以检查 O(M+1)(如果单词可能这样做),其中 M 是没有词干或复数的单词的长度(这是一个词吗?呵呵)。 一旦你构建了前缀树,我就会重新分析它的词干等,并将其压缩,以便根词保存结果。
在搜索时,您可以制定一些简单的规则,以使匹配在根或茎或您拥有的情况下返回正值。
第二部分似乎极具挑战性。 我天真的倾向是为形容词-主语分组保留单独的结果。 使用与上面相同的原则,但将其分开。
语义分析的另一种选择可以是将每个句子建模为主语、动词等关系的树(句子有主语和动词,主语有名词和形容词等)。 一旦您以这种方式分解了所有文本,似乎就很容易浏览并快速计算所发生的不同的适当配对。
只是一些漫谈,我确信有更好的想法,但我喜欢思考这些东西。
The first part of your question doesn't sound so bad. All you basically need to do is read each word from the file (or stream w/e) and place it into a prefix tree and each time you happen upon a word that already exists you increment the value associated with it. Of course you would have an ignore list of everything you'd like left out of your calculations as well.
If you use a prefix tree you ensure that to find any word is going to O(N) where N is the maximum length of a word in your data set. The advantage of a prefix tree in this situation is that if you want to look for plurals and stemming you can check in O(M+1) if that's even possible for the word, where M is the length of the word without stem or plurality (is that a word? hehe). Once you've built your prefix tree I would reanalyze it for the stems and such and condense it down so that the root word is what holds the results.
Upon searching you could have some simple rules in place to have the match return positive in case of the root or stem or what have you.
The second part seems extremely challenging. My naive inclination would be to hold separate results for adjective-subject groupings. Use the same principles as above but just keep it separate.
Another option for the semantic analysis could be modeling each sentence as a tree of subject, verb, etc relationships (Sentence has a subject and verb, subject has a noun and adjective, etc). Once you've broken all of your text up in this way it seems like it might be fairly easy to run through and get a quick count of the different appropriate pairings that occurred.
Just some ramblings, I'm sure there are better ideas, but I love thinking about this stuff.
欢迎来到 NLP 的世界^_^
您所需要的只是一点基础知识和一些工具。
已经有一些工具可以告诉您句子中的单词是名词、形容词还是动词。 它们被称为词性标注器。 通常,它们采用纯文本英语作为输入,并输出单词、其基本形式和词性。 以下是流行的 UNIX 词性标注器对您帖子第一句的输出:
如您所见,它将“algorithms”识别为“algorithm”的复数形式 (NNS),将“exists”识别为“存在”的变形 (VBZ)。 它还将“a”和“the”标识为“限定词(DT)”——冠词的另一种说法。 正如您所看到的,词性标注器还对标点符号进行了标记。
要完成列表中除最后一点之外的所有操作,您只需通过 POS 标记器运行文本,过滤掉您不感兴趣的类别(限定词、代词等)并计算基本形式的频率的话。
以下是一些流行的词性标注器:
TreeTagger(仅限二进制:Linux、 Solaris、OS-X)
GENIA Tagger(C++:自行编译)
Stanford POS Tagger (Java)
要完成列表中的最后一件事,您需要超过只是单词级信息。 一个简单的开始方法是计算单词的序列 单词,而不仅仅是单词本身。 这些称为 n-gram。 UNIX for Poets 是一个不错的起点。 如果您愿意投资一本有关 NLP 的书,我会推荐统计自然语言处理基础。
Welcome to the world of NLP ^_^
All you need is a little basic knowledge and some tools.
There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called part-of-speech taggers. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:
As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.
To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.
Here are some popular POS taggers:
TreeTagger (binary only: Linux, Solaris, OS-X)
GENIA Tagger (C++: compile your self)
Stanford POS Tagger (Java)
To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting sequences of words rather than just words themselves. These are called n-grams. A good place to start is UNIX for Poets. If you are willing to invest in a book on NLP, I would recommend Foundations of Statistical Natural Language Processing.