斯坦福核心 NLP - 理解共指解析
我在理解斯坦福 NLP 工具最新版本中对 coref 解析器所做的更改时遇到了一些困难。 作为示例,下面是一个句子和相应的 CorefChainAnnotation:
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}
我不确定我是否理解这些数字的含义。查看源代码也没有什么帮助。
谢谢
I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools.
As an example, below is a sentence and the corresponding CorefChainAnnotation:
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}
I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我一直在研究共指依赖图,并从使用这个问题的其他答案开始。一段时间后,我意识到上面的算法并不完全正确。它产生的输出甚至不接近我的修改版本。
对于使用本文的其他人,这是我最终得出的算法,它也过滤掉了自我引用,因为每个代表性提及也提到了自己,并且很多提及仅引用了自己。
示例句子的最终输出如下:
通常“原子”最终成为代表性提及,但在这种情况下,这并不奇怪。另一个输出稍微更准确的例子是以下句子:
独立战争发生在 1700 年代,这是美国的第一次战争。
产生以下输出:
I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.
For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.
And the final output for your example sentence is the following:
Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:
The Revolutionary War occurred during the 1700s and it was the first war in the United States.
produces the following output:
第一个数字是集群id(代表token,代表同一个实体),参见
SieveCoreferenceSystem#coref(Document)
源代码。对编号不在 CorefChain#toString() 之外:其中位置是实体提及的一组位置对(要获取它们,请使用 CorefChain.getCorefMentions())。下面是一个完整的代码示例(在 groovy 中),它展示了如何从持仓到代币:
输出(我不明白 's' 来自哪里):
The first number is a cluster id (representing tokens, which stand for the same entity), see source code of
SieveCoreferenceSystem#coref(Document)
. The pair numbers are outout of CorefChain#toString():where position is a set of postion pairs of entity mentioning (to get them use
CorefChain.getCorefMentions()
). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:Output (I do not understand where 's' comes from):
这些是注释者的最新结果。
标记为如下:
属于同一簇的文本指的是同一上下文。
These are the recent results from the annotator.
The markings are as follows :
The text belonging to the same cluster refers to the same context.