文本分类从文本中提取标签
我有一个包含大量文本数据的lucene索引,每个项目都有一个描述,我想从描述中提取更常见的单词并生成标签以根据描述对每个项目进行分类,是否有一个lucene.net库可以做这个或任何其他用于文本分类的库?
I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不,lucene.net 可以实现搜索、索引、文本规范化、“查找更多这样的”功能,但不能实现文本分类。
向您建议什么取决于您的要求。所以,也许需要更多描述。
但是,一般来说,最简单的方法是尝试使用外部服务。所有外部服务都有 REST API,并且使用 C# 与其交互非常容易。
来自外部服务:
还有很好的 Java SDK,如 Mahout。我记得与 Mahout 的交互也可以像服务一样完成,因此与其集成根本不是问题。
我使用 C# 进行了类似的“自动标记”任务,并且我已经使用了 Open Calais。每天免费进行 50,000 笔交易。这对我来说已经足够了。 uClassify 的定价也不错,例如“独立”许可证每年 99 美元。
但也许外部服务和 Mahout 不适合您。看看 DBpedia 项目和 RDF。
最后,您至少可以使用朴素贝叶斯算法的一些实现。这很简单,一切都在您的掌控之中。
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
这是一个非常难的问题,但如果你不想花时间在这上面,你可以选取整个文档中出现频率在 5% 到 10% 之间的所有单词。或者,您只需选取最常见的 5 个单词即可。
做好标签提取是非常非常困难的。整个公司都靠暴露这样的 API 的 Web 服务生存是非常困难的。
您还可以删除停用词(使用从互联网获取的固定停用词列表)。
您还可以找到常见的 N 元语法(例如对),可用于查找多词标签。
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.