如何使用 Lucene 获取频繁出现的短语

发布于 2024-09-08 03:56:20 字数 542 浏览 12 评论 0原文

我想用 Lucene 获取一些经常出现的短语。我从 TXT 文件中获取一些信息，但由于没有短语信息，我失去了很多上下文，例如“信息检索”被索引为两个单独的单词。

像这样的短语有什么方法可以得到呢？我在互联网上找不到任何有用的东西，所有的建议、链接、提示尤其是示例都受到赞赏！

编辑：我仅按标题和内容存储文档：

 Document doc = new Document();
 doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
 doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));

因为对于我正在做的事情来说，最重要的是文件的内容。标题常常根本没有描述性（例如，我有许多 PDF 学术论文，其标题是代码或数字）。

我迫切需要从文本内容中索引最常出现的短语，现在我发现这种简单的“词袋”方法效率不高。

原文

I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information for phrases e.g. "information retrieval" is indexed as two separate words.

What is the way to get the phrases like this? I can not find anything useful on internet, all the advices, links, hints especially examples are appreciated!

EDIT: I store my documents just by title and content:

 Document doc = new Document();
 doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
 doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));

because for what I am doing the most important is the content of the file. Titles are too often not descriptive at all (e.g., I have many PDF academic papers whose titles are codes or numbers).

I desperately need to index top occurring phrases from text contents, just now I see how much this simple "bag of words" approach is not efficient.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

狂之美人 2024-09-15 03:56:20

朱莉娅，看来您正在寻找的是n-grams，特别是Bigrams（也称为搭配）。

这是曼宁和舒茨的统计自然语言处理基础。

为了使用 Lucene 执行此操作，我建议将 Solr 与 ShingleFilterFactory。
有关详细信息，请参阅此讨论。

回复收藏 0 原文

倾城月光淡如水﹏ 2024-09-15 03:56:20

您可以发布您编写的任何代码吗？

基本上很大程度上取决于您在 lucene 中创建字段和存储文档的方式。

让我们考虑一个我有两个字段的情况：
ID 和评论；在我的 ID 字段中，我允许使用像“finding nemo”这样的值，即带有空格的字符串。而“注释”是一个自由流动的文本字段，即我允许我的键盘允许的以及 lucene 可以理解的任何内容。

现在，在现实生活中，将我的 ID:'finding nemo' 作为两个不同的可搜索字符串是没有意义的。而我想在评论中索引所有内容。

所以我要做的是，我将创建一个文档（org.apache.lucene.document.Document）对象来处理这个......像这样的

Document doc = new Document();
doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));

所以，本质上我创建了两个字段：

注释：我更喜欢使用 Field.Index.ANALYZED 来分析它
id：我指示 lucene 存储它但不分析它 Field.Index.NOT_ANALYZED

这是为默认分词器和分析器自定义 lucene 的方法。否则，您可以编写自己的分词器和分析器。

链接
http://darksleep.com/lucene/

希望这会对您有所帮助...:)

Is it possible for you to post any code that you have written?

Basically a lot depends on the way you create your fields and store documents in lucene.

Lets consider a case where I have got two fields:
ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.

Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.

So what I will do is, I will create a document (org.apache.lucene.document.Document) object to take care of this... Something like this

Document doc = new Document();
doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));

So, essentially I have created two fields:

comments: Where I have preferred to analyze it by using Field.Index.ANALYZED
id: Where I directed lucene to store it but do not analyze it Field.Index.NOT_ANALYZED

This is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.

Link(s)
http://darksleep.com/lucene/

Hope this will help you... :)

回复收藏 0 原文

烟燃烟灭 2024-09-15 03:56:20

那么丢失短语上下文的问题可以通过使用 PhraseQuery 来解决。

只要您没有创建纯布尔值，索引默认包含术语的位置信息
字段通过使用 omitTermFreqAndPositions 选项进行索引。
PhraseQuery 使用此信息来查找术语彼此在一定距离内的文档。

例如，假设一个字段包含短语“敏捷的棕色狐狸跳过了懒狗”。在不知道确切的短语的情况下，您仍然可以通过搜索字段包含“quick”和“fox”彼此靠近的文档来找到此文档。当然，一个普通的 TermQuery 可以通过知道这些单词中的任何一个来定位该文档，但在这种情况下，我们只需要包含短语的文档，其中单词要么完全并排（快速狐狸），要么中间有一个单词（快[不相关]狐狸）。
被视为匹配的项之间的最大允许位置距离称为斜率。
距离是按顺序重构短语的术语位置移动的次数。

查看 Lucene 的 PhraseQuery JavaDoc

查看此示例代码，演示如何使用各种查询对象：

您还可以尝试在 BooleanQuery 类的帮助下组合各种查询类型。

关于短语的频率，我认为 Lucene 的评分考虑了文档中出现的术语的频率。