我有一个数据集,在底层文本上有多层注释,例如 part-of-标签,来自浅层解析器的块,名称实体,以及来自各种 自然语言处理 (NLP) 工具。对于像 The man gone to the store
这样的句子,注释可能如下所示:
Word POS Chunk NER
==== === ===== ========
The DT NP Person
man NN NP Person
went VBD VP -
to TO PP -
the DT NP Location
store NN NP Location
我想使用 Lucene 索引一堆带有此类注释的文档,然后跨不同层执行搜索。简单查询的一个示例是检索 Washington 被标记为 person 的所有文档。虽然我并不完全致力于该符号,但从语法上讲,最终用户可能会输入如下查询:
查询:Word=Washington,NER=Person
我也喜欢执行更复杂的查询,涉及不同层的注释顺序,例如查找所有标记为person的单词,后跟单词的文档到达
,后跟标记为位置的单词。这样的查询可能如下所示:
查询: "NER=Person Word=arrived Word=at NER=Location"
使用 Lucene 解决此问题的好方法是什么?是否有办法对包含结构化标记的文档字段进行索引和搜索?
有效负载
一个建议是尝试使用 Lucene 有效负载。但是,我认为有效负载只能用于调整文档的排名,并且不能用于选择返回哪些文档。
后者很重要,因为对于某些用例,包含模式的文档数量确实是我想要的。
此外,仅检查与查询匹配的术语的有效负载。这意味着有效负载甚至只能帮助第一个示例查询的排名,Word=Washington,NER=Person
,因此我们只想确保术语 >Washingonton
被标记为 Person
。但是,对于第二个示例查询“NER=Person Word=arrived Word=at NER=Location”,我需要检查未指定(因此不匹配)术语的标签。
I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store
, the annotations might look like:
Word POS Chunk NER
==== === ===== ========
The DT NP Person
man NN NP Person
went VBD VP -
to TO PP -
the DT NP Location
store NN NP Location
I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:
Query: Word=Washington,NER=Person
I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at
followed by a word tagged location. Such a query might look like:
Query: "NER=Person Word=arrived Word=at NER=Location"
What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?
Payloads
One suggestion was to try to use Lucene payloads. But, I thought payloads could only be used to adjust the rankings of documents, and that they aren't used to select what documents are returned.
The latter is important since, for some use-cases, the number of documents that contain a pattern is really what I want.
Also, only the payloads on terms that match the query are examined. This means that payloads could only even help with the rankings of the first example query, Word=Washington,NER=Person
, whereby we just want to make sure the term Washingonton
is tagged as a Person
. However, for the second example query, "NER=Person Word=arrived Word=at NER=Location"
, I need to check the tags on unspecified, and thus non-matching, terms.
发布评论
评论(3)
也许实现您所要求的目的的一种方法是在同一位置(即 Word、POS、Chunk、NER)对每个注释类别进行索引,并为每个注释添加唯一的字符串前缀。不要为单词的前缀而烦恼。您将需要一个自定义分析器来保留前缀,但随后您应该能够使用您想要的查询语法。
具体来说,我建议您在指定位置索引以下标记:
要获取语义,请使用 SpanQuery 或 SpanTermQuery 保留标记序列。
我还没有尝试过这个,但是在同一位置索引不同类别的术语应该允许位置敏感查询做正确的事情来评估表达式,例如
请注意与示例的区别:我删除了 Word= 前缀以将其视为默认值。此外,您选择的前缀语法(例如“class=”)可能会限制您正在索引的文档的内容。确保文档不包含短语,或者在预处理中以某种方式转义它们。当然,这与您需要使用的分析仪有关。
更新:我使用这种技术来索引文本中的句子和段落边界(使用
break=sen
和break=para
标记),以便我可以决定在哪里中断短语查询匹配。似乎工作得很好。Perhaps one way to achieve what you're asking is to index each class of annotation at the same position (i.e., Word, POS, Chunk, NER) and prefix each of the annotations with a unique string. Don't bother with prefixes for words. You will need a custom Analyzer to preserve the prefixes, but then you should be able to use the syntax you want for queries.
To be specific, what I am proposing is that you index the following tokens at the specified positions:
To get the semantics, use SpanQuery or SpanTermQuery to preserve token sequence.
I haven't tried this but indexing the different classes of terms at the same position should allow position-sensitive queries to do the right thing to evaluate expressions such as
Note the difference from your example: I deleted the Word= prefix to treat that as the default. Also, your choice of prefix syntax (e.g., "class=") may constrain the contents of the document you are indexing. Make sure that the documents either don't contain the phrases, or that you escape them in some way in pre-processing. This is, of course, related to the analyzer you'll need to use.
Update: I used this technique for indexing sentence and paragraph boundaries in text (using
break=sen
andbreak=para
tokens) so that I could decide where to break phrase query matches. Seems to work just fine.您正在寻找的是 详细的博客条目关于这个问题。有效负载允许您存储有关各个术语的元数据的字节数组。一旦您使用有效负载对数据进行了索引,您就可以创建一个新的相似性机制,在评分时将您的有效负载考虑在内。
What you are looking for are payloads. Lucid Imagination has a detailed blog entry on the subject. Payloads allow you to store a byte array of metadata about individual terms. Once you have indexed your data with the payloads including, you can create a new similarity mechanism that takes your payloads into account when scoring.
您确实可以使用 SpanQuery 在 Lucene 中搜索文本模式 并调整倾斜距离来限制查询术语可以出现的彼此数量,甚至它们出现的顺序。
You can indeed do search for patterns of text in Lucene using SpanQuery and adjust the slop distance to limit how many terms of each other the query terms can occur, and even the order in which they appear.