有没有推荐一个简单的词袋搜索引擎?
有什么小型、轻量级的词袋搜索引擎推荐吗?
我有一组“文档”,每个文档基本上都是一小袋任意单词。 给定一个新文档,我需要获取“相似”文档的列表以及它们可能相似程度的一些权重。文件可能很小……最多几段。
- 词干分析会很棒,但不是非常需要。
- 不需要使用词网进行词扩展。
- 首选开源或免费软件,因为这是一个原型,而不是一个成熟的项目。
- unix/linux平台优先。
我将它用作子组件,并期望仅向其提供带有 ID 的文档,然后搜索与我当前拥有的文档“相似”的文档。
Any recommendations for small, lightweight, bag of words search engine?
I have a set of 'documents' that are each basically a small bag of arbitrary words.
Given a new document, I need to get a list of 'similar' documents along with some weight for how similar they might be. Documents are likely to be small.. a couple paragraphs at most.
- Stemming would be great but not highly required.
- Word expansion with word nets not required.
- opensource or freeware preferred, as this is a prototype, not a full-blow project.
- unix/linux platform preferred.
I'd be using it as a subcomponent and expect only to feed it documents with an ID and would later do searches for 'similar' documents to one I currently have.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Whoosh 是一个纯Python(无C,无外部数据库)索引器/搜索引擎。请查看文档了解更多信息。它确实支持词干提取。
我在 mediawiki 实例的 XML 转储上尝试了它,它似乎工作得很好!
Whoosh is a pure Python (no C, no external database) indexer / search engine. Check out the documentation for more information. It does support stemming.
I tried it out on an XML dump of a mediawiki instance and it seemed to work pretty well!
Solr 或 狮身人面像。它们并不完全是轻量级的,但我不会推荐任何更小的项目,如果项目成功并且需要发展,切换搜索引擎可能会很痛苦。
Solr or Sphinx. They aren't exactly lightweight but I wouldn't recommend anything smaller, if the project turns out to be successful and it needs to grow, switching the search engine might be painful.
我认为 Lucene 是一个选择。它应该允许您构建自定义词袋搜索引擎。
I think that Lucene is an option. It should allow you to build a custom bag of words search engine.
我想知道 MongoDB http://www.mongodb.org/display/DOCS/Home
看来“全文搜索”可能就是我想要的......
并且有额外的字段来搜索可能会很方便。
I wonder about MongoDB http://www.mongodb.org/display/DOCS/Home
It seems like 'full-text-search' may be what I'm after...
and having additional fields to search with may be handy.