即使没有对其进行添加、更新或删除操作,Lucene 索引文件也会不断变化
我注意到,即使我不执行任何添加,更新或删除操作,我的Lucene索引段文件(文件名)也总是在不断变化。我执行的唯一操作是阅读和搜索。因此,我的问题是,Lucene索引段文件是否仅通过阅读和搜索操作就可以内部更新?
如果这很重要,我正在使用Lucene.net v4.8 beta。谢谢!
这是我如何找到此问题的一个示例(我想获取索引大小)。假设已经存在Lucene索引,我使用以下代码获取索引大小:
示例:
private long GetIndexSize()
{
var reader = GetDirectoryReader("validPath");
long size = 0;
foreach (var fileName in reader.Directory.ListAll())
{
size += reader.Directory.FileLength(fileName);
}
return size;
}
private DirectoryReader GetDirectoryReader(string path)
{
var directory = FSDirectory.Open(path);
var reader = DirectoryReader.Open(directory);
return reader;
}
上述方法每5分钟调用每5分钟。 〜98%的时间正常。但是,其他2%的时间,我会在foreach
循环中找到错误文件,在调试后,我看到了
中的文件Reader.Directory
正在数量变化。该索引在某些时间通过另一个服务更新,但是我可以确保在此错误发生的时间附近的任何地方都不会对索引进行更新。
I have noticed that, my lucene index segment files (file names) are always changing constantly, even when I am not performing any add, update, or delete operations. The only operations I am performing is reading and searching. So, my question is, does Lucene index segment files get updated internally somehow just from reading and searching operations?
I am using Lucene.Net v4.8 beta, if that matters. Thanks!
Here is an example of how I found this issue (I wanted to get the index size). Assuming a Lucene Index already exists, I used the following code to get the index size:
Example:
private long GetIndexSize()
{
var reader = GetDirectoryReader("validPath");
long size = 0;
foreach (var fileName in reader.Directory.ListAll())
{
size += reader.Directory.FileLength(fileName);
}
return size;
}
private DirectoryReader GetDirectoryReader(string path)
{
var directory = FSDirectory.Open(path);
var reader = DirectoryReader.Open(directory);
return reader;
}
The above method is called every 5 minutes. It works fine ~98% of the time. However, the other 2% of the time, I would get the error file not found
in the foreach
loop, and after debugging, I saw that the files in reader.Directory
are changing in count. The index is updated at certain times by another service, but I can assure that no updates were made to the index anywhere near the times when this error occurs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于您有多个进程写入/读取同一组文件,因此很难隔离正在发生的情况。 Lucene.NET 做了锁定和异常处理以确保操作可以在进程之间同步,但是如果直接读取目录中的文件而不做任何锁定,则需要准备好处理 IOException 。 。
解决方案取决于您需要索引大小的最新程度:
Directory.ListAll()
更新一些,因为该方法将文件名存储在数组中,在循环完成之前该数组可能会过时。但是,您仍然需要捕获 FileNotFoundException 并忽略它,并可能处理其他 IOException。当然,如果您走这条路,您的 IndexWriter 可能会抛出 LockObtainFailedException ,您应该准备好在写入过程中处理它们。
无论您如何处理它,您都需要捕获和处理异常,因为 IO 本质上有很多可能出错的地方。但具体如何处理它取决于您的优先事项。
原始答案
如果您打开了一个
IndexWriter
实例,Lucene.NET 将运行一个后台进程来根据所使用的MergePolicy
合并段。默认设置可用于大多数应用程序。但是,可以通过
IndexWriterConfig.MergePolicy
属性。默认情况下,它使用TieredMergePolicy
。有 几个属性位于
TieredMergePolicy
上,可用于更改用于合并的阈值。或者,可以将其更改为不同的
MergePolicy
实现。 Lucene.NET 附带:NoMergePolicy
类可用于完全禁用合并。还可以使用
IndexWriterConfig.MergeScheduler
属性。默认情况下,它使用ConcurrentMergeScheduler。Lucene.NET 4.8.0 附带的合并调度程序为:
NoMergeScheduler
类可用于完全禁用合并。这与使用NoMergePolicy
具有相同的效果,但也会阻止执行任何调度代码。Since you have multiple processes writing/reading the same set of files, it is difficult to isolate what is happening. Lucene.NET does locking and exception handling to ensure operations can be synced up between processes, but if you read the files in the directory directly without doing any locking, you need to be prepared to deal with
IOException
s.The solution depends on how up to date you need the index size to be:
Directory.ListAll()
because that method stores the file names in an array, which may go stale before the loop is done. But, you still need to catchFileNotFoundException
and ignore it and possibly deal with otherIOException
s.Of course, if you go that route, your
IndexWriter
may throw aLockObtainFailedException
and you should be prepared to handle them during the write process.However you deal with it, you need to be catching and handling exceptions because IO by its nature has many things that can go wrong. But exactly how you deal with it depends on what your priorities are.
Original Answer
If you have an
IndexWriter
instance open, Lucene.NET will run a background process to merge segments based on theMergePolicy
being used. The default settings can be used with most applications.However, the settings are configurable through the
IndexWriterConfig.MergePolicy
property. By default, it uses theTieredMergePolicy
.There are several properties on
TieredMergePolicy
that can be used to change the thresholds that it uses to merge.Or, it can be changed to a different
MergePolicy
implementation. Lucene.NET comes with:The
NoMergePolicy
class can be used to disable merging entirely.The merge scheduler can also be swapped and/or configured using the
IndexWriterConfig.MergeScheduler
property. By default, it uses theConcurrentMergeScheduler
.The merge schedulers that are included with Lucene.NET 4.8.0 are:
The
NoMergeScheduler
class can be used to disable merging entirely. This has the same effect as usingNoMergePolicy
, but also prevents any scheduling code from being executed.