使用 lucene 重新索引/从索引中删除术语
我希望你能帮助我,这是我的问题:
编辑:现在我重新思考,如果有办法从索引中删除术语,无论如何它都会起作用。有办法做到这一点吗?如果有,则无需阅读问题的其余部分。谢谢!
这是我打算做的: 1 - 我必须在删除标准停用词的同时索引一些文件。 2 - 之后,我必须计算每个术语的文档频率,并删除那些 df < 的术语。 2
我是如何做的:
1 - 我使用索引编写器对文件进行索引,同时删除 std 停用词。 2 - 我计算每个术语的 df,并将其添加到停用词列表中。 3 - 然后,我使用索引编写器再次对文本进行索引,但使用新的停用词列表
实际发生的情况:
我第一次索引它按计划进行。问题是当我第二次尝试索引时。结果变得非常不可预测:
1)如果我运行程序一次,即使停用词有新词,也只会删除标准停用词。
2)如果我第二次运行该程序,则 df < 的项2 被删除。
我将索引中的术语打印两次,一次在第一次索引后,一次在第二次索引后。
当我第二次运行时,df < 的条款2 在第一次打印中出现删除(请注意,我在第二次索引时添加了 df < 2 的术语,它不应该)在第一次打印中出现删除。
也许我解释的方式有点混乱,如果有什么不明白的地方请你告诉我。
我希望你们能帮助我。 非常感谢!
I hope you can help me, here is my problem:
edit: Now that I re-thought, if there is a way to delete a term from the index, it would work anyway. Is there a way to do that? if there is, there is no need to read the rest of the question. thanks!
Here is what I intend to do:
1 - I have to index some files while removing the standard stopwords.
2 - Afterwards, I must count the document frequency of every term, and remove those terms that have df < 2
How I'm doing it:
1 - I index the files using indexwriter, while removing the std stopwords.
2 - I count the df of every term, and add to the stopwords list.
3 - And then, I index again the texts using indexwriter, but with the new stopwords list
What's really happening:
The first time I index it goes as planned. The problem is when I try to index for a second time. The result becomes pretty unpredictable:
1) if i run the program one time, even though the stopwords has new words,only the std stopwords are removed.
2) if i run the program a second time, then the terms with df < 2 are removed.
I print the terms in the index twice, one after indexing for the first time, and one after indexing for the second time.
When i run for a second time, the terms with df < 2 appear removed in the first print(notice that I add the terms with df < 2 when indexing for the second time, It shouldn't) appear removed in the first print.
Maybe the way I explained was a bit confusing, I ask you to tell me if something couldn't be understood.
I hope you guys can help me.
Thank you very much!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当第二次索引文档时,请确保删除文档的第一个实例,否则您将导致所有术语的 dfs 膨胀。您可以通过外部 id 字段删除文档:使用 field=idfield & 创建一个
Term
value=externalId,然后使用IndexWriter的deleteDocument(Term)删除旧实例;然后添加新的。我认为没有办法明确删除术语;它们源自文档。作为优化,您可以考虑以下事项:
1. 索引所有文档
2. 查找所有 df = 1 的项
3. 删除具有每个此类术语的所有文档,并跟踪其外部文档 ID
4. 将术语添加到停用列表中
5. 仅重新索引先前删除的文档。
当然,首先您必须仔细考虑删除这些术语的用例。
1. 为什么它们出现在索引中很重要?
2. 如果您稍后更新索引,并添加一个新文档,导致某些先前具有 df=1 的术语现在具有 df=2,会发生什么情况。您将无法对该术语建立索引,因为它已经位于停止列表中。
When indexing documents for the second time, make sure to delete the first instance of the document, otherwise you will inflate the dfs for all terms. You can delete documents by the external id field: create a
Term
with field=idfield & value=externalId, and then usedeleteDocument(Term)
of IndexWriter to remove the old instance; then add the new one. I don't think there is a way to delete terms explicitly; they are derived from the documents.As an optimization, you might consider the following:
1. Index all documents
2. Find all terms with df = 1
3. Remove all documents with each such term, keeping track of their external document ids
4. Add the terms to your stop list
5. Re-index only the previously-removed documents.
Of course first you have to think carefully about the use case for removing these terms.
1. Why does it matter if they occur in the index?
2. What happens if you update the index later, and add a new document that causes some term that previously had df=1 now to have df=2. You wouldn't be able to index on that term since it would have been already in the stop list.