使用 Lucene (PyLucene) 查找单个字段项

发布于 2025-01-01 10:06:13 字数 383 浏览 5 评论 0原文

我对 Lucene 的术语向量相当陌生 - 并且希望确保我的术语收集尽可能高效。 我获取独特的术语,然后检索该术语的 docFreq() 以执行分面。

我使用以下方法从索引中收集所有文档术语:

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
terms = ireader.terms() #Returns TermEnum

这工作正常,但是有没有办法只返回特定字段(跨所有文档)的术语 - 这不是更有效吗?

例如:

 ireader.terms(Field="country")

I'm fairly new to Lucene's Term Vectors - and want to make sure my term gathering is as efficient as it possibly can be.
I'm getting the unique terms and then retrieving the docFreq() of the term to perform faceting.

I'm gathering all documents terms from the index using:

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
terms = ireader.terms() #Returns TermEnum

This works fine, but is there a way to only return terms for specific fields (across all documents) - wouldn't that be more efficient?

Such as:

 ireader.terms(Field="country")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

毁梦 2025-01-08 10:06:13

IndexReader.terms() 接受可选的 Field() 对象。
Field 对象由两个参数组成,Field Name 和 Value,lucene 称之为“Term Field”和“Term Text”。

通过为“术语文本”提供一个空值的 Field 参数,我们可以从我们关心的术语开始术语迭代。

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
# Query the lucene index for the terms starting at a term named "field_name"
terms = ireader.terms(Term("field_name", "")) #Start at the field "field_name"
facets = {'other': 0}
while terms.next():
    if terms.term().field() != "field_name":  #We've got every value
        break
    print "Field Name:", terms.term().field()
    print "Field Value:", terms.term().text()
    print "Matching Docs:", int(ireader.docFreq(term))

希望其他搜索如何在 PyLucene 中执行分面的人能够看到这篇文章。关键是按原样索引术语。为了完整起见,这就是字段值的索引方式。

dir = SimpleFSDirectory(File(indexdir))
analyzer = StandardAnalyzer(Version.LUCENE_30)
writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))
print "Currently there are %d documents in the index..." % writer.numDocs()
print "Adding %s Documents to Index..." % docs.count()
for val in terms:
    doc = Document()
    #Store the field, as-is, with term-vectors.
    doc.add(Field("field_name", val, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES))
    writer.addDocument(doc)

writer.optimize()
writer.close()

IndexReader.terms() accepts an optional Field() object.
Field objects are composed of two arguments, the Field Name, and Value which lucene calls the "Term Field" and the "Term Text".

By providing a Field argument with an empty value for 'term text' we can start our term iteration at the term we are concerned with.

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
# Query the lucene index for the terms starting at a term named "field_name"
terms = ireader.terms(Term("field_name", "")) #Start at the field "field_name"
facets = {'other': 0}
while terms.next():
    if terms.term().field() != "field_name":  #We've got every value
        break
    print "Field Name:", terms.term().field()
    print "Field Value:", terms.term().text()
    print "Matching Docs:", int(ireader.docFreq(term))

Hopefully others searching for how to perform faceting in PyLucene will see come across this post. The key is indexing terms as-is. Just for completeness this is how field values should be indexed.

dir = SimpleFSDirectory(File(indexdir))
analyzer = StandardAnalyzer(Version.LUCENE_30)
writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))
print "Currently there are %d documents in the index..." % writer.numDocs()
print "Adding %s Documents to Index..." % docs.count()
for val in terms:
    doc = Document()
    #Store the field, as-is, with term-vectors.
    doc.add(Field("field_name", val, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES))
    writer.addDocument(doc)

writer.optimize()
writer.close()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文