文档分析和标记
假设我有一堆文章(数千篇)想要标记、分类等。理想情况下,我想通过手动分类/标记几百篇文章来训练一些东西,然后让东西松了。
您会推荐哪些资源(书籍、博客、语言)来完成这样的任务?我的一部分认为这非常适合 贝叶斯分类器 甚至 潜在语义分析,但除了我从一些 < 中发现的内容之外,我对这两者都不太熟悉a href="http://rubyforge.org/projects/bishop/" rel="nofollow">ruby 宝石。
贝叶斯分类器可以解决这样的问题吗?我应该更多地关注语义分析/自然语言处理吗?或者,我应该只是从那里寻找关键字密度和映射吗?
任何建议都表示赞赏(如果需要的话,我不介意买几本书)!
Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
哇,这是一个相当大的话题,你正在冒险:)
您肯定可以阅读很多有关它的书籍和文章,但我会尽力提供简短的介绍。我不是一个大专家,但我研究过一些这样的东西。
首先,您需要决定是否要将论文分类为预定义的主题/类别(分类问题),还是希望算法自行决定不同的组(聚类问题)。从您的描述看来您对分类感兴趣。
现在,在进行分类时,首先需要创建足够的训练数据。你需要有许多论文被分成不同的组。例如 5 篇物理论文、5 篇化学论文、5 篇编程论文等等。一般来说,您需要尽可能多的训练数据,但多少才足够取决于特定的算法。您还需要验证数据,它与训练数据基本相似,但完全独立。该数据将用于判断算法的质量(或数学上的性能)。
最后是算法本身。我熟悉的两个是基于贝叶斯和基于TF-IDF。对于贝叶斯,我目前正在用 ruby 为自己开发类似的东西,并且我已经在我的博客中记录了我的经验。如果您有兴趣,请阅读此内容 - http://arubyguy.com/2011 /03/03/bayes-classification-update/ 如果您有任何后续问题,我会尽力回答。
TF-IDF 是 TermFrequence - InverseDocumentFrequency 的缩写。基本上,这个想法是对于任何给定的文档,在训练集中找到与其最相似的许多文档,然后根据这些文档找出它的类别。例如,如果文档 D 类似于 T1(物理)、T2(物理)和 T3(化学),则您猜测 D 很可能是关于物理和一点化学的。
这样做的方式是对罕见的单词应用最重要的重要性,而对常见的单词不重视。例如,“原子核”是罕见的物理词,但“功”是非常常见的无趣词。 (这就是为什么它被称为逆项频率)。如果您可以使用 Java,那么有一个非常非常好的 Lucene 库,它提供了大部分现成的东西。查找“类似文档”的 API 并研究它是如何实现的。或者如果您想实现自己的,只需谷歌搜索“TF-IDF”
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
我过去曾使用某种向量聚类算法做过类似的事情(尽管是针对简短的新闻文章)。我现在不记得了,这是谷歌在其起步阶段使用的。
利用他们的论文,我在一两天内就可以在 PHP 中运行一个原型,然后出于速度目的将其移植到 Java。
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf