斯坦福核心 NLP - 理解共指解析

发布于 2024-11-18 15:29:01 字数 339 浏览 4 评论 0原文

我在理解斯坦福 NLP 工具最新版本中对 coref 解析器所做的更改时遇到了一些困难。 作为示例,下面是一个句子和相应的 CorefChainAnnotation:

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

我不确定我是否理解这些数字的含义。查看源代码也没有什么帮助。

谢谢

I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools.
As an example, below is a sentence and the corresponding CorefChainAnnotation:

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迎风吟唱 2024-11-25 15:29:01

我一直在研究共指依赖图,并从使用这个问题的其他答案开始。一段时间后,我意识到上面的算法并不完全正确。它产生的输出甚至不接近我的修改版本。

对于使用本文的其他人,这是我最终得出的算法,它也过滤掉了自我引用,因为每个代表性提及也提到了自己,并且很多提及仅引用了自己。

Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);

for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
    CorefChain c = entry.getValue();

    //this is because it prints out a lot of self references which aren't that useful
    if(c.getCorefMentions().size() <= 1)
        continue;

    CorefMention cm = c.getRepresentativeMention();
    String clust = "";
    List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
    for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
        clust += tks.get(i).get(TextAnnotation.class) + " ";
    clust = clust.trim();
    System.out.println("representative mention: \"" + clust + "\" is mentioned by:");

    for(CorefMention m : c.getCorefMentions()){
        String clust2 = "";
        tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
        for(int i = m.startIndex-1; i < m.endIndex-1; i++)
            clust2 += tks.get(i).get(TextAnnotation.class) + " ";
        clust2 = clust2.trim();
        //don't need the self mention
        if(clust.equals(clust2))
            continue;

        System.out.println("\t" + clust2);
    }
}

示例句子的最终输出如下:

representative mention: "a basic unit of matter" is mentioned by:
The atom
it

通常“原子”最终成为代表性提及,但在这种情况下,这并不奇怪。另一个输出稍微更准确的例子是以下句子:

独立战争发生在 1700 年代,这是美国的第一次战争。

产生以下输出:

representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States

I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.

For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.

Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);

for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
    CorefChain c = entry.getValue();

    //this is because it prints out a lot of self references which aren't that useful
    if(c.getCorefMentions().size() <= 1)
        continue;

    CorefMention cm = c.getRepresentativeMention();
    String clust = "";
    List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
    for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
        clust += tks.get(i).get(TextAnnotation.class) + " ";
    clust = clust.trim();
    System.out.println("representative mention: \"" + clust + "\" is mentioned by:");

    for(CorefMention m : c.getCorefMentions()){
        String clust2 = "";
        tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
        for(int i = m.startIndex-1; i < m.endIndex-1; i++)
            clust2 += tks.get(i).get(TextAnnotation.class) + " ";
        clust2 = clust2.trim();
        //don't need the self mention
        if(clust.equals(clust2))
            continue;

        System.out.println("\t" + clust2);
    }
}

And the final output for your example sentence is the following:

representative mention: "a basic unit of matter" is mentioned by:
The atom
it

Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:

The Revolutionary War occurred during the 1700s and it was the first war in the United States.

produces the following output:

representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States
云朵有点甜 2024-11-25 15:29:01

第一个数字是集群id(代表token,代表同一个实体),参见SieveCoreferenceSystem#coref(Document)源代码。对编号不在 CorefChain#toString() 之外:

public String toString(){
    return position.toString();
}

其中位置是实体提及的一组位置对(要获取它们,请使用 CorefChain.getCorefMentions())。下面是一个完整的代码示例(在 groovy 中),它展示了如何从持仓到代币:

class Example {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        props.put("dcoref.score", true);
        pipeline = new StanfordCoreNLP(props);
        Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");

        pipeline.annotate(document);
        Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

        println aText

        for(Map.Entry<Integer, CorefChain> entry : graph) {
          CorefChain c =   entry.getValue();                
          println "ClusterId: " + entry.getKey();
          CorefMention cm = c.getRepresentativeMention();
          println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);

          List<CorefMention> cms = c.getCorefMentions();
          println  "Mentions:  ";
          cms.each { it -> 
              print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
          }         
        }
    }
}

输出(我不明白 's' 来自哪里):

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention:  basic unit 
Mentions:  basic unit |
ClusterId: 8
Representative Mention:  unit 
Mentions:  unit |
ClusterId: 10
Representative Mention: it 
Mentions: it |

The first number is a cluster id (representing tokens, which stand for the same entity), see source code of SieveCoreferenceSystem#coref(Document). The pair numbers are outout of CorefChain#toString():

public String toString(){
    return position.toString();
}

where position is a set of postion pairs of entity mentioning (to get them use CorefChain.getCorefMentions()). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:

class Example {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        props.put("dcoref.score", true);
        pipeline = new StanfordCoreNLP(props);
        Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");

        pipeline.annotate(document);
        Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

        println aText

        for(Map.Entry<Integer, CorefChain> entry : graph) {
          CorefChain c =   entry.getValue();                
          println "ClusterId: " + entry.getKey();
          CorefMention cm = c.getRepresentativeMention();
          println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);

          List<CorefMention> cms = c.getCorefMentions();
          println  "Mentions:  ";
          cms.each { it -> 
              print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
          }         
        }
    }
}

Output (I do not understand where 's' comes from):

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention:  basic unit 
Mentions:  basic unit |
ClusterId: 8
Representative Mention:  unit 
Mentions:  unit |
ClusterId: 10
Representative Mention: it 
Mentions: it |
国粹 2024-11-25 15:29:01

这些是注释者的最新结果。

  1. [1, 1] 1 原子
  2. [1, 2] 1 物质的基本单位
  3. [1, 3] 1 它
  4. [1, 6] 6 个带负电的电子
  5. [1, 5] 5 一团带负电的电子

标记为如下:

[Sentence number,'id']  Cluster_no  Text_Associated

属于同一簇的文本指的是同一上下文。

These are the recent results from the annotator.

  1. [1, 1] 1 The atom
  2. [1, 2] 1 a basic unit of matter
  3. [1, 3] 1 it
  4. [1, 6] 6 negatively charged electrons
  5. [1, 5] 5 a cloud of negatively charged electrons

The markings are as follows :

[Sentence number,'id']  Cluster_no  Text_Associated

The text belonging to the same cluster refers to the same context.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文