当前位置：文江博客话题详情

拆分 Lucene 索引文件而不重新索引

发布于 2024-09-17 12:44:04 字数 154 浏览 6 评论 0原文

有没有一种方法可以根据某种规则从单个索引文件生成单独的索引文件，而无需再次重新索引文档？

原始索引包含未存储的字段，这意味着我无法读取文档并将它们添加到目标索引。

SO 中提到的一种选择是将索引克隆为多个，然后删除不属于该索引的文档。我正在寻找更好的解决方案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦屿孤独相伴 2024-09-24 12:44:04

SO 中提到的一个选项是将索引克隆为多个，然后删除不属于该索引的文档。我正在寻找更好的解决方案。

这个解决方案有什么问题？我觉得这是一个非常干净的解决方案，只涉及几行代码。

更新：

关于您有 100G 索引，想要拆分 500 次的场景，请尝试以下操作：对于要从索引中提取的每个文档子集，创建到源索引的硬链接，打开链接索引并删除不属于该索引的文档。如果您使用的是 Linux，则可以通过以下方式完成目录的硬链接：

cp -lrp myindex myindex.copy

可以根据需要多次执行此操作，并且链接不会占用任何磁盘空间。

One option mentioned in SO is to clone the index into many and then delete the documents that don't belong to that index. I'm looking for a better solution.

What's wrong with this solution? This strikes me as a very clean solution, involving just a few lines of code.

UPDATE:

Regarding the scenario where you have a 100G index, wanting to split 500 times, try this: for every subset of documents that you want to carve out of the index, create hard links to the source index, open the linked index and delete the documents that don't belong to that index. If you're on Linux, hard linking the directory can be done by:

cp -lrp myindex myindex.copy

This can be done as many times as you need to and the links do not consume any disk space.

回复收藏 0 原文

独自唱情﹋歌 2024-09-24 12:44:04

我在寻找解决方案时首先发现了这个问题，所以我将把我的解决方案留在这里供后代使用。就我而言，我需要沿着特定的线分割索引，而不是任意地从中间或分成三份或其他什么。这是使用 Lucene 3.0.3 的 C# 解决方案。

我的应用程序索引大小超过 300GB，这变得有点难以管理。索引中的每个文档都与使用该应用程序的制造工厂之一相关联。一家工厂没有理由搜索另一家工厂的数据，因此我需要沿着这些路线干净地划分索引。这是我为此编写的代码：

var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();
var sourceDir = GetOldIndexDir();
foreach (var plantID in distinctPlantIDs)
{
    var query = new TermQuery(new Term("PlantID", plantID.ToString()));
    var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant's index will go

    //read each plant's documents and write them to the new index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceSearcher = new IndexSearcher(sourceDir, true))
    using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        var numHits = sourceSearcher.DocFreq(query.Term);
        if (numHits <= 0) continue;
        var hits = sourceSearcher.Search(query, numHits).ScoreDocs;
        foreach (var hit in hits)
        {
            var doc = sourceSearcher.Doc(hit.Doc);
            destWriter.AddDocument(doc);
        }
        destWriter.Optimize();
        destWriter.Commit();
    }

    //delete the documents out of the old index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        sourceWriter.DeleteDocuments(query);
        sourceWriter.Commit();
    }
}

从旧索引中删除记录的部分是存在的，因为在我的例子中，一种植物的记录占据了索引的大部分（超过 2/3）。因此，在我的真实版本中，有一些额外的代码可以最后执行该工厂，并且不会像其他那样将其拆分出来，而是会优化剩余的索引（这就是该工厂），然后将其移动到新目录。

无论如何，希望这对那里的人有帮助。

I found this question first when searching for a solution to my problem, so I will leave my solution here for future generations. In my case, I needed to split my index along specific lines, not arbitrarily down the middle or into thirds or what have you. This is a C# solution using Lucene 3.0.3.

My app's index is over 300GB in size, which was becoming a little unmanageable. Each document in the index is associated to one of the manufacturing plants that uses the app. There is no business reason that one plant would ever search for another plant's data, so I needed to cleanly divide the index along those lines. Here's the code I wrote to do so:

var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();
var sourceDir = GetOldIndexDir();
foreach (var plantID in distinctPlantIDs)
{
    var query = new TermQuery(new Term("PlantID", plantID.ToString()));
    var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant's index will go

    //read each plant's documents and write them to the new index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceSearcher = new IndexSearcher(sourceDir, true))
    using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        var numHits = sourceSearcher.DocFreq(query.Term);
        if (numHits <= 0) continue;
        var hits = sourceSearcher.Search(query, numHits).ScoreDocs;
        foreach (var hit in hits)
        {
            var doc = sourceSearcher.Doc(hit.Doc);
            destWriter.AddDocument(doc);
        }
        destWriter.Optimize();
        destWriter.Commit();
    }

    //delete the documents out of the old index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        sourceWriter.DeleteDocuments(query);
        sourceWriter.Commit();
    }
}

That part that deletes the records out of the old index is there because in my case, one plant's records took up the majority of the index (over 2/3rds). So in my real version there is some extra code to do that plant last, and instead of splitting it out like the others it will optimize the remaining index (which is just that plant) and then move it to its new directory.

Anyway, hope this helps someone out there.

回复收藏 0 原文

~没有更多了~