斯坦福核心 NLP - 理解共指解析

发布于 2024-11-18 15:29:01 字数 339 浏览 4 评论 0原文

我在理解斯坦福 NLP 工具最新版本中对 coref 解析器所做的更改时遇到了一些困难。作为示例，下面是一个句子和相应的 CorefChainAnnotation：

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

我不确定我是否理解这些数字的含义。查看源代码也没有什么帮助。

谢谢

原文

I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools.
As an example, below is a sentence and the corresponding CorefChainAnnotation:

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.

Thank you

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迎风吟唱 2024-11-25 15:29:01

我一直在研究共指依赖图，并从使用这个问题的其他答案开始。一段时间后，我意识到上面的算法并不完全正确。它产生的输出甚至不接近我的修改版本。

对于使用本文的其他人，这是我最终得出的算法，它也过滤掉了自我引用，因为每个代表性提及也提到了自己，并且很多提及仅引用了自己。

Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);

for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
    CorefChain c = entry.getValue();

    //this is because it prints out a lot of self references which aren't that useful
    if(c.getCorefMentions().size() <= 1)
        continue;

    CorefMention cm = c.getRepresentativeMention();
    String clust = "";
    List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
    for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
        clust += tks.get(i).get(TextAnnotation.class) + " ";
    clust = clust.trim();
    System.out.println("representative mention: \"" + clust + "\" is mentioned by:");

    for(CorefMention m : c.getCorefMentions()){
        String clust2 = "";
        tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
        for(int i = m.startIndex-1; i < m.endIndex-1; i++)
            clust2 += tks.get(i).get(TextAnnotation.class) + " ";
        clust2 = clust2.trim();
        //don't need the self mention
        if(clust.equals(clust2))
            continue;

        System.out.println("\t" + clust2);
    }
}

示例句子的最终输出如下：

representative mention: "a basic unit of matter" is mentioned by:
The atom
it

通常“原子”最终成为代表性提及，但在这种情况下，这并不奇怪。另一个输出稍微更准确的例子是以下句子：

独立战争发生在 1700 年代，这是美国的第一次战争。

产生以下输出：

representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States

I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.

For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.

Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);

for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
    CorefChain c = entry.getValue();

    //this is because it prints out a lot of self references which aren't that useful
    if(c.getCorefMentions().size() <= 1)
        continue;

    CorefMention cm = c.getRepresentativeMention();
    String clust = "";
    List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
    for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
        clust += tks.get(i).get(TextAnnotation.class) + " ";
    clust = clust.trim();
    System.out.println("representative mention: \"" + clust + "\" is mentioned by:");

    for(CorefMention m : c.getCorefMentions()){
        String clust2 = "";
        tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
        for(int i = m.startIndex-1; i < m.endIndex-1; i++)
            clust2 += tks.get(i).get(TextAnnotation.class) + " ";
        clust2 = clust2.trim();
        //don't need the self mention
        if(clust.equals(clust2))
            continue;

        System.out.println("\t" + clust2);
    }
}

And the final output for your example sentence is the following:

representative mention: "a basic unit of matter" is mentioned by:
The atom
it

Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:

The Revolutionary War occurred during the 1700s and it was the first war in the United States.

produces the following output:

representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States

回复收藏 0 原文

云朵有点甜 2024-11-25 15:29:01

第一个数字是集群id（代表token，代表同一个实体），参见SieveCoreferenceSystem#coref(Document)源代码。对编号不在 CorefChain#toString() 之外：

public String toString(){
    return position.toString();
}

其中位置是实体提及的一组位置对（要获取它们，请使用 CorefChain.getCorefMentions()）。下面是一个完整的代码示例（在 groovy 中），它展示了如何从持仓到代币：

class Example {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        props.put("dcoref.score", true);
        pipeline = new StanfordCoreNLP(props);
        Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");

        pipeline.annotate(document);
        Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

        println aText

        for(Map.Entry<Integer, CorefChain> entry : graph) {
          CorefChain c =   entry.getValue();                
          println "ClusterId: " + entry.getKey();
          CorefMention cm = c.getRepresentativeMention();
          println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);

          List<CorefMention> cms = c.getCorefMentions();
          println  "Mentions:  ";
          cms.each { it -> 
              print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
          }         
        }
    }
}

输出（我不明白 's' 来自哪里）：

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention:  basic unit 
Mentions:  basic unit |
ClusterId: 8
Representative Mention:  unit 
Mentions:  unit |
ClusterId: 10
Representative Mention: it 
Mentions: it |

The first number is a cluster id (representing tokens, which stand for the same entity), see source code of SieveCoreferenceSystem#coref(Document). The pair numbers are outout of CorefChain#toString():

public String toString(){
    return position.toString();
}

where position is a set of postion pairs of entity mentioning (to get them use CorefChain.getCorefMentions()). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:

class Example {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        props.put("dcoref.score", true);
        pipeline = new StanfordCoreNLP(props);
        Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");

        pipeline.annotate(document);
        Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

        println aText

        for(Map.Entry<Integer, CorefChain> entry : graph) {
          CorefChain c =   entry.getValue();                
          println "ClusterId: " + entry.getKey();
          CorefMention cm = c.getRepresentativeMention();
          println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);

          List<CorefMention> cms = c.getCorefMentions();
          println  "Mentions:  ";
          cms.each { it -> 
              print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
          }         
        }
    }
}

Output (I do not understand where 's' comes from):

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention:  basic unit 
Mentions:  basic unit |
ClusterId: 8
Representative Mention:  unit 
Mentions:  unit |
ClusterId: 10
Representative Mention: it 
Mentions: it |

回复收藏 0 原文

国粹 2024-11-25 15:29:01

这些是注释者的最新结果。

[1, 1] 1 原子
[1, 2] 1 物质的基本单位
[1, 3] 1 它
[1, 6] 6 个带负电的电子
[1, 5] 5 一团带负电的电子

标记为如下：

[Sentence number,'id']  Cluster_no  Text_Associated

属于同一簇的文本指的是同一上下文。

These are the recent results from the annotator.

[1, 1] 1 The atom
[1, 2] 1 a basic unit of matter
[1, 3] 1 it
[1, 6] 6 negatively charged electrons
[1, 5] 5 a cloud of negatively charged electrons

The markings are as follows :

[Sentence number,'id']  Cluster_no  Text_Associated

The text belonging to the same cluster refers to the same context.

回复收藏 0 原文

~没有更多了~

关于作者

以歌曲疗慰

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

斯坦福核心 NLP - 理解共指解析

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

斯坦福核心 NLP - 理解共指解析

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。