即使没有对其进行添加、更新或删除操作,Lucene 索引文件也会不断变化

发布于 2025-01-18 11:52:37 字数 965 浏览 3 评论 0原文

我注意到,即使我不执行任何添加,更新或删除操作,我的Lucene索引段文件(文件名)也总是在不断变化。我执行的唯一操作是阅读和搜索。因此,我的问题是,Lucene索引段文件是否仅通过阅读和搜索操作就可以内部更新?

如果这很重要,我正在使用Lucene.net v4.8 beta。谢谢!

这是我如何找到此问题的一个示例(我想获取索引大小)。假设已经存在Lucene索引,我使用以下代码获取索引大小:

示例:

private long GetIndexSize()
        {
            var reader = GetDirectoryReader("validPath");
            long size = 0;

            foreach (var fileName in reader.Directory.ListAll())
            {
                size += reader.Directory.FileLength(fileName);
            }

            return size;
        }
private DirectoryReader GetDirectoryReader(string path)
{
    var directory = FSDirectory.Open(path);
    var reader = DirectoryReader.Open(directory);
    return reader;
}

上述方法每5分钟调用每5分钟。 〜98%的时间正常。但是,其他2%的时间,我会在foreach循环中找到错误文件,在调试后,我看到了中的文件Reader.Directory正在数量变化。该索引在某些时间通过另一个服务更新,但是我可以确保在此错误发生的时间附近的任何地方都不会对索引进行更新。

I have noticed that, my lucene index segment files (file names) are always changing constantly, even when I am not performing any add, update, or delete operations. The only operations I am performing is reading and searching. So, my question is, does Lucene index segment files get updated internally somehow just from reading and searching operations?

I am using Lucene.Net v4.8 beta, if that matters. Thanks!

Here is an example of how I found this issue (I wanted to get the index size). Assuming a Lucene Index already exists, I used the following code to get the index size:

Example:

private long GetIndexSize()
        {
            var reader = GetDirectoryReader("validPath");
            long size = 0;

            foreach (var fileName in reader.Directory.ListAll())
            {
                size += reader.Directory.FileLength(fileName);
            }

            return size;
        }
private DirectoryReader GetDirectoryReader(string path)
{
    var directory = FSDirectory.Open(path);
    var reader = DirectoryReader.Open(directory);
    return reader;
}

The above method is called every 5 minutes. It works fine ~98% of the time. However, the other 2% of the time, I would get the error file not found in the foreach loop, and after debugging, I saw that the files in reader.Directory are changing in count. The index is updated at certain times by another service, but I can assure that no updates were made to the index anywhere near the times when this error occurs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

喜你已久 2025-01-25 11:52:37

由于您有多个进程写入/读取同一组文件,因此很难隔离正在发生的情况。 Lucene.NET 做了锁定和异常处理以确保操作可以在进程之间同步,但是如果直接读取目录中的文件而不做任何锁定,则需要准备好处理 IOException 。 。

解决方案取决于您需要索引大小的最新程度:

  1. 如果可以有点过时,我建议使用 DirectoryInfo.EnumerateFiles 在目录本身上。这可能比 Directory.ListAll() 更新一些,因为该方法将文件名存储在数组中,在循环完成之前该数组可能会过时。但是,您仍然需要捕获 FileNotFoundException 并忽略它,并可能处理其他 IOException。
  2. 如果您需要大小绝对是最新的,并且计划执行需要索引为该大小的操作,则需要打开写锁以防止在获取值时文件发生更改。

private long GetIndexSize()
{
    // DirectoryReader is superfluous for this example. Also,
    // using a MMapDirectory (which DirectoryReader.Open() may return)
    // will use more RAM than simply using SimpleFSDirectory.
    var directory = new SimpleFSDirectory("validPath");
    long size = 0;

    // NOTE: The lock will stay active until this is disposed,
    // so if you have any follow-on actions to perform, the lock
    // should be obtained before calling this method and disposed
    // after you have completed all of your operations.
    using Lock writeLock = directory.MakeLock(IndexWriter.WRITE_LOCK_NAME);

    // Obtain exclusive write access to the directory
    if (!writeLock.Obtain(/* optional timeout */))
    {
         // timeout failed, either throw an exception or retry...
    }

    foreach (var fileName in directory.ListAll())
    {
        size += directory.FileLength(fileName);
    }

    return size;
}

当然,如果您走这条路,您的 IndexWriter 可能会抛出 LockObtainFailedException ,您应该准备好在写入过程中处理它们。

无论您如何处理它,您都需要捕获和处理异常,因为 IO 本质上有很多可能出错的地方。但具体如何处理它取决于您的优先事项。

原始答案

如果您打开了一个 IndexWriter 实例,Lucene.NET 将运行一个后台进程来根据所使用的 MergePolicy 合并段。默认设置可用于大多数应用程序。

但是,可以通过 IndexWriterConfig.MergePolicy 属性。默认情况下,它使用TieredMergePolicy

var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
    MergePolicy = new TieredMergePolicy()
};

几个属性位于 TieredMergePolicy 上,可用于更改用于合并的阈值。

或者,可以将其更改为不同的 MergePolicy 实现。 Lucene.NET 附带:

NoMergePolicy 类可用于完全禁用合并。

如果您的应用程序永远不需要向索引添加文档(例如,如果索引是作为应用程序部署的一部分构建的),则也可以使用 IndexReader >Directory 直接实例,不进行任何后台段合并。

还可以使用 IndexWriterConfig.MergeScheduler 属性。默认情况下,它使用ConcurrentMergeScheduler。

var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
    MergePolicy = new TieredMergePolicy(),
    MergeScheduler = new ConcurrentMergeScheduler()
};

Lucene.NET 4.8.0 附带的合并调度程序为:

NoMergeScheduler 类可用于完全禁用合并。这与使用 NoMergePolicy 具有相同的效果,但也会阻止执行任何调度代码。

Since you have multiple processes writing/reading the same set of files, it is difficult to isolate what is happening. Lucene.NET does locking and exception handling to ensure operations can be synced up between processes, but if you read the files in the directory directly without doing any locking, you need to be prepared to deal with IOExceptions.

The solution depends on how up to date you need the index size to be:

  1. If it is okay to be a bit out of date, I would suggest using DirectoryInfo.EnumerateFiles on the directory itself. This may be a bit more up to date than Directory.ListAll() because that method stores the file names in an array, which may go stale before the loop is done. But, you still need to catch FileNotFoundException and ignore it and possibly deal with other IOExceptions.
  2. If you need the size to be absolutely up to date and plan to do an operation that requires the index to be that size, you need to open a write lock to prevent the files from changing while you get the value.

private long GetIndexSize()
{
    // DirectoryReader is superfluous for this example. Also,
    // using a MMapDirectory (which DirectoryReader.Open() may return)
    // will use more RAM than simply using SimpleFSDirectory.
    var directory = new SimpleFSDirectory("validPath");
    long size = 0;

    // NOTE: The lock will stay active until this is disposed,
    // so if you have any follow-on actions to perform, the lock
    // should be obtained before calling this method and disposed
    // after you have completed all of your operations.
    using Lock writeLock = directory.MakeLock(IndexWriter.WRITE_LOCK_NAME);

    // Obtain exclusive write access to the directory
    if (!writeLock.Obtain(/* optional timeout */))
    {
         // timeout failed, either throw an exception or retry...
    }

    foreach (var fileName in directory.ListAll())
    {
        size += directory.FileLength(fileName);
    }

    return size;
}

Of course, if you go that route, your IndexWriter may throw a LockObtainFailedException and you should be prepared to handle them during the write process.

However you deal with it, you need to be catching and handling exceptions because IO by its nature has many things that can go wrong. But exactly how you deal with it depends on what your priorities are.

Original Answer

If you have an IndexWriter instance open, Lucene.NET will run a background process to merge segments based on the MergePolicy being used. The default settings can be used with most applications.

However, the settings are configurable through the IndexWriterConfig.MergePolicy property. By default, it uses the TieredMergePolicy.

var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
    MergePolicy = new TieredMergePolicy()
};

There are several properties on TieredMergePolicy that can be used to change the thresholds that it uses to merge.

Or, it can be changed to a different MergePolicy implementation. Lucene.NET comes with:

The NoMergePolicy class can be used to disable merging entirely.

If your application never needs to add documents to the index (for example, if the index is built as part of the application deployment), it is also possible to use a IndexReader from a Directory instance directly, which does not do any background segment merges.

The merge scheduler can also be swapped and/or configured using the IndexWriterConfig.MergeScheduler property. By default, it uses the ConcurrentMergeScheduler.

var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
    MergePolicy = new TieredMergePolicy(),
    MergeScheduler = new ConcurrentMergeScheduler()
};

The merge schedulers that are included with Lucene.NET 4.8.0 are:

The NoMergeScheduler class can be used to disable merging entirely. This has the same effect as using NoMergePolicy, but also prevents any scheduling code from being executed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文