当前位置：文江博客话题详情

Lucene：搜索时加载索引文件？

发布于 2025-01-03 21:24:13 字数 120 浏览 3 评论 0原文

谁能解释一下搜索时索引文件是如何加载到内存中的？

整个文件（fnm、tis、fdt 等）是一次性加载还是分块加载？

如何加载各个段以及以什么顺序加载？

如何加密Lucene索引？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

画离情绘悲伤 2025-01-10 21:24:13

拥有索引段的要点是您很少可以将整个索引加载到内存中。

设计索引格式时要考虑的最重要的限制是磁盘寻道时间相对较长（在仍然最广泛使用的盘式硬盘驱动器上）。一个好的估计是每个字节的传输时间约为 0.01 至 0.02 μs，而磁头的平均寻道时间约为 5 ms！

因此，保存在内存中的部分通常只是字典，用于查找磁盘上倒排列表的起始块*。其他部分仅按需加载，然后从内存中清除，为其他搜索腾出空间。

至于加密，这取决于您是否需要始终保持索引加密（即使在内存中），还是仅加密索引文件就足够了。至于后者，我认为加密的文件系统就足够了。至于前者，这当然也是可能的，因为不同的索引压缩技术已经到位。然而，我不认为它被广泛使用，因为全文引擎的首要要求是速度。

[*] 这并不是那么简单，因为我们正在对字典执行二进制搜索，因此我们需要确保第一个结构中的所有条目都具有相同的长度。由于字典中的普通单词显然不是这种情况，并且应用填充成本太高（考虑某些化学物质的单词长度），因此我们实际上维护两层字典，第一层（需要适合内存并且是存储在 .tii 文件中）在第二个索引（.tis 文件）中保留术语起始位置的排序列表。第二个索引是所有项按升序排列的串联数组，以及指向 .frq 文件中扇区的指针。第二个索引通常适合内存并在开始时加载，但这可能是不可能的，例如对于二元索引。另请注意，有时 Lucene 默认情况下不使用单个文件，而是使用所谓的复合文件（带有 .cfs 扩展名）来减少打开文件的数量。

The main point of having the index segments is that you can rarely load the whole index in the memory.

The most important limitation that is taken into account while designing the index format is that disk seek time is relatively long (on plate-base hard drives, that are still most widely used). A good estimation is that the transfer time per byte is about 0.01 to 0.02 μs, while average seek time of disk head is about 5 ms!

So the part that is kept in memory is typically only the dictionary, used to find out the beginning block of the postings list on the disk*. The other parts are loaded only on-demand and then purged from the memory to make room for other searches.

As for encryption, it depends on whether you need to keep the index encrypted all the time (even when in memory) or if it suffices to encrypt only the index files. As for the latter, I think that an encrypted file system will be enough. As for the former, it is also certainly possible, as different index compression techniques are already in place. However, I don't think it's widely used, as the first and foremost requirement for full-text engine is speed.

[*] It's not really such simple, as we're performing binary searches against the dictionary, so we need to ensure that all entries in the first structure have equal length. As it's clearly not the case with normal words in dictionary and applying padding is too much costly (think of word lengths for some chemical substances), we actually maintain two levels of dictionary, the first one (which needs to fit in the memory and is stored in .tii files) keeps sorted list of starting positions of terms in the second index (.tis files). The second index is then a concatenated array of all terms in an increasing order, along with pointer to the sector in the .frq file. The second index often fits in the memory and is loaded at the start, but it can be impossible e.g. for bigram indexes. Also note that for some time Lucene by default doesn't use individual files, but so called compound files (with .cfs extension) to cut down the number of open files.

回复收藏 0 原文

~没有更多了~