Lucene 中字级注释层的索引和搜索

发布于 2024-09-02 06:10:18 字数 1819 浏览 7 评论 0 原文

我有一个数据集，在底层文本上有多层注释，例如 part-of-标签，来自浅层解析器的块，名称实体，以及来自各种自然语言处理 (NLP) 工具。对于像 The man gone to the store 这样的句子，注释可能如下所示：

Word  POS  Chunk       NER
====  ===  =====  ========
The    DT     NP    Person     
man    NN     NP    Person
went  VBD     VP         -
to     TO     PP         - 
the    DT     NP  Location
store  NN     NP  Location

我想使用 Lucene 索引一堆带有此类注释的文档，然后跨不同层执行搜索。简单查询的一个示例是检索 Washington 被标记为 person 的所有文档。虽然我并不完全致力于该符号，但从语法上讲，最终用户可能会输入如下查询：

查询：Word=Washington,NER=Person

我也喜欢执行更复杂的查询，涉及不同层的注释顺序，例如查找所有标记为person的单词，后跟单词的文档到达，后跟标记为位置的单词。这样的查询可能如下所示：

查询： "NER=Person Word=arrived Word=at NER=Location"

使用 Lucene 解决此问题的好方法是什么？是否有办法对包含结构化标记的文档字段进行索引和搜索？

有效负载

一个建议是尝试使用 Lucene 有效负载。但是，我认为有效负载只能用于调整文档的排名，并且不能用于选择返回哪些文档。

后者很重要，因为对于某些用例，包含模式的文档数量确实是我想要的。

此外，仅检查与查询匹配的术语的有效负载。这意味着有效负载甚至只能帮助第一个示例查询的排名，Word=Washington,NER=Person，因此我们只想确保术语 >Washingonton 被标记为 Person。但是，对于第二个示例查询“NER=Person Word=arrived Word=at NER=Location”，我需要检查未指定（因此不匹配）术语的标签。

原文

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like:


Word  POS  Chunk       NER
====  ===  =====  ========
The    DT     NP    Person     
man    NN     NP    Person
went  VBD     VP         -
to     TO     PP         - 
the    DT     NP  Location
store  NN     NP  Location

I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:

Query: Word=Washington,NER=Person

I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like:

Query: "NER=Person Word=arrived Word=at NER=Location"

What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

Payloads

One suggestion was to try to use Lucene payloads. But, I thought payloads could only be used to adjust the rankings of documents, and that they aren't used to select what documents are returned.

The latter is important since, for some use-cases, the number of documents that contain a pattern is really what I want.

Also, only the payloads on terms that match the query are examined. This means that payloads could only even help with the rankings of the first example query, Word=Washington,NER=Person, whereby we just want to make sure the term Washingonton is tagged as a Person. However, for the second example query, "NER=Person Word=arrived Word=at NER=Location", I need to check the tags on unspecified, and thus non-matching, terms.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

像你 2024-09-09 06:10:18

也许实现您所要求的目的的一种方法是在同一位置（即 Word、POS、Chunk、NER）对每个注释类别进行索引，并为每个注释添加唯一的字符串前缀。不要为单词的前缀而烦恼。您将需要一个自定义分析器来保留前缀，但随后您应该能够使用您想要的查询语法。

具体来说，我建议您在指定位置索引以下标记：

Position Word   POS      Chunk     NER
======== ====   ===      =====     ========
1        The    POS=DT   CHUNK=NP  NER=Person     
2        man    POS=NN   CHUNK=NP  NER=Person
3        went   POS=VBD  CHUNK=VP       -
4        to     POS=TO   CHUNK=PP       - 
5        the    POS=DT   CHUNK=NP  NER=Location
6        store  POS=NN   CHUNK=NP  NER=Location

要获取语义，请使用 SpanQuery 或 SpanTermQuery 保留标记序列。

我还没有尝试过这个，但是在同一位置索引不同类别的术语应该允许位置敏感查询做正确的事情来评估表达式，例如

NER=人员到达 NER=位置

请注意与示例的区别：我删除了 Word= 前缀以将其视为默认值。此外，您选择的前缀语法（例如“class=”）可能会限制您正在索引的文档的内容。确保文档不包含短语，或者在预处理中以某种方式转义它们。当然，这与您需要使用的分析仪有关。

更新：我使用这种技术来索引文本中的句子和段落边界（使用 break=sen 和 break=para 标记），以便我可以决定在哪里中断短语查询匹配。似乎工作得很好。

Perhaps one way to achieve what you're asking is to index each class of annotation at the same position (i.e., Word, POS, Chunk, NER) and prefix each of the annotations with a unique string. Don't bother with prefixes for words. You will need a custom Analyzer to preserve the prefixes, but then you should be able to use the syntax you want for queries.

To be specific, what I am proposing is that you index the following tokens at the specified positions:

Position Word   POS      Chunk     NER
======== ====   ===      =====     ========
1        The    POS=DT   CHUNK=NP  NER=Person     
2        man    POS=NN   CHUNK=NP  NER=Person
3        went   POS=VBD  CHUNK=VP       -
4        to     POS=TO   CHUNK=PP       - 
5        the    POS=DT   CHUNK=NP  NER=Location
6        store  POS=NN   CHUNK=NP  NER=Location

To get the semantics, use SpanQuery or SpanTermQuery to preserve token sequence.

I haven't tried this but indexing the different classes of terms at the same position should allow position-sensitive queries to do the right thing to evaluate expressions such as

NER=Person arrived at NER=Location

Note the difference from your example: I deleted the Word= prefix to treat that as the default. Also, your choice of prefix syntax (e.g., "class=") may constrain the contents of the document you are indexing. Make sure that the documents either don't contain the phrases, or that you escape them in some way in pre-processing. This is, of course, related to the analyzer you'll need to use.

Update: I used this technique for indexing sentence and paragraph boundaries in text (using break=sen and break=para tokens) so that I could decide where to break phrase query matches. Seems to work just fine.