如何使用 Lucene 获取频繁出现的短语
我想用 Lucene 获取一些经常出现的短语。我从 TXT 文件中获取一些信息,但由于没有短语信息,我失去了很多上下文,例如“信息检索”被索引为两个单独的单词。
像这样的短语有什么方法可以得到呢?我在互联网上找不到任何有用的东西,所有的建议、链接、提示尤其是示例都受到赞赏!
编辑:我仅按标题和内容存储文档:
Document doc = new Document();
doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));
因为对于我正在做的事情来说,最重要的是文件的内容。标题常常根本没有描述性(例如,我有许多 PDF 学术论文,其标题是代码或数字)。
我迫切需要从文本内容中索引最常出现的短语,现在我发现这种简单的“词袋”方法效率不高。
I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information for phrases e.g. "information retrieval" is indexed as two separate words.
What is the way to get the phrases like this? I can not find anything useful on internet, all the advices, links, hints especially examples are appreciated!
EDIT: I store my documents just by title and content:
Document doc = new Document();
doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));
because for what I am doing the most important is the content of the file. Titles are too often not descriptive at all (e.g., I have many PDF academic papers whose titles are codes or numbers).
I desperately need to index top occurring phrases from text contents, just now I see how much this simple "bag of words" approach is not efficient.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
朱莉娅,看来您正在寻找的是n-grams,特别是Bigrams(也称为搭配)。
这是曼宁和舒茨的统计自然语言处理基础。
为了使用 Lucene 执行此操作,我建议将 Solr 与 ShingleFilterFactory。
有关详细信息,请参阅此讨论。
Julia, It seems what you are looking for is n-grams, specifically Bigrams (also called collocations).
Here's a chapter about finding collocations (PDF) from Manning and Schutze's Foundations of Statistical Natural Language Processing.
In order to do this with Lucene, I suggest using Solr with ShingleFilterFactory.
Please see this discussion for details.
您可以发布您编写的任何代码吗?
基本上很大程度上取决于您在 lucene 中创建字段和存储文档的方式。
让我们考虑一个我有两个字段的情况:
ID 和评论;在我的 ID 字段中,我允许使用像“finding nemo”这样的值,即带有空格的字符串。而“注释”是一个自由流动的文本字段,即我允许我的键盘允许的以及 lucene 可以理解的任何内容。
现在,在现实生活中,将我的 ID:'finding nemo' 作为两个不同的可搜索字符串是没有意义的。而我想在评论中索引所有内容。
所以我要做的是,我将创建一个文档(
org.apache.lucene.document.Document
)对象来处理这个......像这样的所以,本质上我创建了两个字段:
Field.Index.ANALYZED
来分析它Field.Index.NOT_ANALYZED
这是为默认分词器和分析器自定义 lucene 的方法。否则,您可以编写自己的分词器和分析器。
链接
http://darksleep.com/lucene/
希望这会对您有所帮助...:)
Is it possible for you to post any code that you have written?
Basically a lot depends on the way you create your fields and store documents in lucene.
Lets consider a case where I have got two fields:
ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.
Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.
So what I will do is, I will create a document (
org.apache.lucene.document.Document
) object to take care of this... Something like thisSo, essentially I have created two fields:
Field.Index.ANALYZED
Field.Index.NOT_ANALYZED
This is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.
Link(s)
http://darksleep.com/lucene/
Hope this will help you... :)
那么丢失短语上下文的问题可以通过使用 PhraseQuery 来解决。
只要您没有创建纯布尔值,索引默认包含术语的位置信息
字段通过使用 omitTermFreqAndPositions 选项进行索引。
PhraseQuery 使用此信息来查找术语彼此在一定距离内的文档。
例如,假设一个字段包含短语“敏捷的棕色狐狸跳过了懒狗”。在不知道确切的短语的情况下,您仍然可以通过搜索字段包含“quick”和“fox”彼此靠近的文档来找到此文档。当然,一个普通的 TermQuery 可以通过知道这些单词中的任何一个来定位该文档,但在这种情况下,我们只需要包含短语的文档,其中单词要么完全并排(快速狐狸),要么中间有一个单词(快[不相关]狐狸)。
被视为匹配的项之间的最大允许位置距离称为斜率。
距离是按顺序重构短语的术语位置移动的次数。
查看 Lucene 的 PhraseQuery JavaDoc
查看此示例代码,演示如何使用各种查询对象:
您还可以尝试在 BooleanQuery 类的帮助下组合各种查询类型。
关于短语的频率,我认为 Lucene 的评分考虑了文档中出现的术语的频率。
Well the problem of losing the context for phrases can be solved by using PhraseQuery.
An index by default contains positional information of terms, as long as you did not create pure Boolean
fields by indexing with the omitTermFreqAndPositions option.
PhraseQuery uses this information to locate documents where terms are within a certain distance of one another.
For example, suppose a field contained the phrase “the quick brown fox jumped over the lazy dog”. Without knowing the exact phrase, you can still find this document by searching for documents with fields having quick and fox near each other. Sure, a plain TermQuery would do the trick to locate this document knowing either of those words, but in this case we only want documents that have phrases where the words are either exactly side by side (quick fox) or have one word in between (quick [irrelevant] fox).
The maximum allowable positional distance between terms to be considered a match is called slop.
Distance is the number of positional moves of terms to reconstruct the phrase in order.
Check out Lucene's JavaDoc for PhraseQuery
See this example code which demonstrates how to work with various Query Objects:
You can also try to combine various query types with the help of the BooleanQuery class.
And regarding the frequency of phrases, I suppose Lucene's scoring considers the frequency of the terms occurring in the documents.