使用标签对文档进行分类
我有大量的文档(主要是 pdf 和 doc)想要分类,因此我可以根据某些标签搜索它们。这些标签可以是我自己的(我将标签添加到文档中),也可以是从文本中提取的。
我刚刚看到一篇与此相关的帖子(使用 Apache Mahout 对数据进行分类),但也许还有更简单的事情。
I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Mahout 可能对您的问题来说太过分了 - 但您可以通过使用 OpenNLP 获得相当快速、简单的解决方案。
http://opennlp.sourceforge.net/api/index.html
具体看opennlp.tools.doccat 包。本质上,您必须为您想要的每个类别检查并手动标记一小部分项目。如果它们确实不同,那么您可以使用较小的样本量。
您可以使用 DocumentCategorizerME.train() 静态函数来训练文档集合,其中每个文档都需要一个类别标签和要训练的文本块。然后,您可以使用经过训练的模型初始化 DocumentCategorizerME,并开始对所有其余文档进行分类。
一旦完成此操作,您就可以(我认为)将模型写入文件,这样您就不必再这样做了。
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
这篇关于提取关键字和分类网页的文章是相关的,可能会有所帮助。在您的示例中,听起来您可以使用标签代替关键字提取部分(尽管您可能想组合使用两者)。 Weka 很容易使用,我绝对推荐看一下。
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.