Lucene 中文档大小的影响

发布于 2024-12-28 18:10:44 字数 205 浏览 3 评论 0原文

我刚刚开始阅读 Lucene。在提供的示例之一中,在将文档添加到索引之前,将整个文件添加到文档中。

然而,文档表明这种索引技术不会提供良好的性能。推荐的方法是将文件的每一行存储在单独的文档中。

我很想知道这如何有助于提高索引性能。

另外,我想验证我的理解,要将文件的每一行添加为文档字段,我们必须首先对该行进行标记以获取标记,然后为其创建一个字段。

I have just started reading up on Lucene. In one of the examples provided, an entire file was being added to a Document prior to adding the Document to an Index.

However the documentation suggested that this indexing technique would not give good performance. The recommended way is to store each line of the file within a separate document.

I was curious to know how this helps to improve indexing performance.

Also, I wanted to validate my understanding that to add every line of file as a Document field, we will have to first tokenize the line to obtain the tokens and then create a field for the same.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

意犹 2025-01-04 18:10:44

即使不考虑性能,这两种方法也不会产生相同的结果。如果您有一个文档,其第一行是“fox”,第二行是“dog”,并且如果您搜索“fox”和“dog”,则第二种方法将不会有结果。

关于你的第二个问题,不,你不需要在创建文档和字段之前执行任何标记化。当您调用 IndexWriter#add(Document) 时,将执行标记化。

如果您正在开始使用 Lucene,我强烈建议您阅读演示代码。这将向您展示如何创建并搜索 Lucene 索引。

如果索引速度对于您正在开发的应用程序至关重要,那么 Lucene wiki。

Even if you don't take performance into account, these two approaches won't yield the same results. If you have a single document whose first line is "fox" and second line is "dog", and if you search for "fox" AND "dog", there will be no results with the second approach.

Regarding your second question, no, you don't need to perform any tokenization before creating documents and fields. Tokenization will be performed when you call IndexWriter#add(Document).

If you are getting started with Lucene, I highly recommend you read the demo code. This will show you how to create and then search a Lucene index.

And if indexing speed is critical for the application you are developing, there are very good advices on Lucene wiki.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文