如何获得原始单词级别实体而不是bert ner中的文字代币

发布于 2025-02-06 17:50:40 字数 3324 浏览 3 评论 0原文

我有一个训练有素的BERT模型，我愿意用来注释一些文本。

我正在以下面的方式使用变形金刚管道进行NER任务：

mode = AutoModelForTokenClassification.from_pretrained(<my_model_path>)
tokenize = BertTokenizer.from_pretrained(<my_model_path>)

nlp_ner = pipeline(
    "ner",
    model=mode,
    tokenizer=tokenize
)

然后，我通过致电来获得预测结果：

text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
result = nlp_ner(text)

返回的结果是：

[{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]

我现在面临的问题是我想注释预测的类文本本身，但这看起来很复杂，因为预测结果不是通过用空间删除它们的索引单词，但是例如，组成的单词被视为多个单词等

。在doccanno json格式中）不太复杂吗？

我的目标是能够说：对于所有“ Label_9”，请用特定的HTML类突出显示初始文本。甚至更容易，找到所有预测为“ label_9”类的单词的开始和结束索引。

原文

I have a trained BERT model that I am willing to use to annotate some text.

I am using the transformers pipeline for the NER task in the following way:

mode = AutoModelForTokenClassification.from_pretrained(<my_model_path>)
tokenize = BertTokenizer.from_pretrained(<my_model_path>)

nlp_ner = pipeline(
    "ner",
    model=mode,
    tokenizer=tokenize
)

Then, I am obtaining the prediction results by calling:

text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
result = nlp_ner(text)

Where the returned result is:

[{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]

The problem I am facing now is that I would like to annotate the predicted classes back to the text itself, but this looks complicated as the prediction results are not indexing words by expolding them with a space, but for example, composed words are seen as multiple words, etc.

Is there a way to annotate back the text (for example, in Doccanno JSON format) that is not too complex?

My goal is to be able to say: For all the "LABEL_9", highlight the initial text with a specific html class. Or even easier, find the start and the end index for all words predicted as being of class "LABEL_9".

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦断已成空 2025-02-13 17:50:40

您在管道内使用的令牌器不会给出偏移信息，因此您在start的模型响应中均未获得。将其更改为快速令牌，您应该获得start位置。

解决方案

from transformers import pipeline, AutoModelForTokenClassification, PreTrainedTokenizerFast

model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased")
tokenizer = PreTrainedTokenizerFast.from_pretrained("bert-base-uncased")

nlp_ner = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer
)

text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
results = nlp_ner(text)

for result in results:
  start = result["start"]
  end = result["start"]+len(result["word"].replace("##", ""))
  tag = result["entity"]

  print (f"Entity: {tag}, Start:{start}, End:{end}, Token:{text[start:end]}")

输出：

Entity: LABEL_1, Start:0, End:1, Token:3
Entity: LABEL_0, Start:1, End:2, Token:)
Entity: LABEL_1, Start:3, End:5, Token:Re
Entity: LABEL_1, Start:5, End:10, Token:write
Entity: LABEL_1, Start:11, End:14, Token:the
Entity: LABEL_0, Start:15, End:19, Token:last
Entity: LABEL_1, Start:20, End:28, Token:sentence
Entity: LABEL_1, Start:29, End:30, Token:“
Entity: LABEL_1, Start:30, End:38, Token:scanning
Entity: LABEL_1, Start:39, End:44, Token:probe
Entity: LABEL_1, Start:45, End:46, Token:.
Entity: LABEL_1, Start:46, End:47, Token:.
Entity: LABEL_1, Start:47, End:48, Token:.
Entity: LABEL_1, Start:49, End:52, Token:per
Entity: LABEL_1, Start:52, End:54, Token:ov
Entity: LABEL_1, Start:54, End:57, Token:ski
Entity: LABEL_0, Start:57, End:59, Token:te
Entity: LABEL_0, Start:60, End:66, Token:family
Entity: LABEL_0, Start:66, End:67, Token:”
Entity: LABEL_0, Start:67, End:68, Token:.
Entity: LABEL_1, Start:69, End:72, Token:The
Entity: LABEL_0, Start:73, End:80, Token:current
Entity: LABEL_0, Start:81, End:84, Token:one
Entity: LABEL_0, Start:85, End:87, Token:is
Entity: LABEL_1, Start:88, End:93, Token:quite
Entity: LABEL_1, Start:94, End:103, Token:confusing
Entity: LABEL_0, Start:103, End:104, Token:.

WordPiece Tokenizer

bert Tokenizer是一个文字令牌，即，一个单词可能会被将令发送到多个令牌中（例如，上面的perovskite）。如果一个单词被归为多个，则使用第一个令牌的实体是一种标准做法。您可以使用其他策略，例如最大投票等。以下是每个单词获取实体的逻辑。

formatted_results = []
for result in results:
  end = result["start"]+len(result["word"].replace("##", ""))  
  
  if result["word"].startswith("##"):
    formatted_results[-1]["end"] = end
    formatted_results[-1]["word"]+= result["word"].replace("##", "")
  else:
    formatted_results.append({
        'start': result["start"], 
        'end': end,
        'entity': result["entity"],
        'index': result["index"],
        'score': result["score"],
        'word': result["word"]})

for result in formatted_results:
    print (f"""Entity: {result["entity"]}, Start:{result["start"]}, End:{result["end"]}, word:{text[result["start"]:result["end"]]}""")

输出：

Entity: LABEL_1, Start:0, End:1, word:3
Entity: LABEL_1, Start:1, End:2, word:)
Entity: LABEL_1, Start:3, End:10, word:Rewrite
Entity: LABEL_1, Start:11, End:14, word:the
Entity: LABEL_1, Start:15, End:19, word:last
Entity: LABEL_0, Start:20, End:28, word:sentence
Entity: LABEL_1, Start:29, End:30, word:“
Entity: LABEL_1, Start:30, End:38, word:scanning
Entity: LABEL_1, Start:39, End:44, word:probe
Entity: LABEL_1, Start:45, End:46, word:.
Entity: LABEL_1, Start:46, End:47, word:.
Entity: LABEL_0, Start:47, End:48, word:.
Entity: LABEL_1, Start:49, End:59, word:perovskite
Entity: LABEL_0, Start:60, End:66, word:family
Entity: LABEL_1, Start:66, End:67, word:”
Entity: LABEL_0, Start:67, End:68, word:.
Entity: LABEL_0, Start:69, End:72, word:The
Entity: LABEL_1, Start:73, End:80, word:current
Entity: LABEL_1, Start:81, End:84, word:one
Entity: LABEL_1, Start:85, End:87, word:is
Entity: LABEL_1, Start:88, End:93, word:quite
Entity: LABEL_1, Start:94, End:103, word:confusing
Entity: LABEL_0, Start:103, End:104, word:.

The Tokenizer you are using inside the pipeline does not give out the offset information and so you get None in the model response for start. Change it to Fast Tokenizer and you should get the start location.

Solution

from transformers import pipeline, AutoModelForTokenClassification, PreTrainedTokenizerFast

model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased")
tokenizer = PreTrainedTokenizerFast.from_pretrained("bert-base-uncased")

nlp_ner = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer
)

text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
results = nlp_ner(text)

for result in results:
  start = result["start"]
  end = result["start"]+len(result["word"].replace("##", ""))
  tag = result["entity"]

  print (f"Entity: {tag}, Start:{start}, End:{end}, Token:{text[start:end]}")

Output:

Entity: LABEL_1, Start:0, End:1, Token:3
Entity: LABEL_0, Start:1, End:2, Token:)
Entity: LABEL_1, Start:3, End:5, Token:Re
Entity: LABEL_1, Start:5, End:10, Token:write
Entity: LABEL_1, Start:11, End:14, Token:the
Entity: LABEL_0, Start:15, End:19, Token:last
Entity: LABEL_1, Start:20, End:28, Token:sentence
Entity: LABEL_1, Start:29, End:30, Token:“
Entity: LABEL_1, Start:30, End:38, Token:scanning
Entity: LABEL_1, Start:39, End:44, Token:probe
Entity: LABEL_1, Start:45, End:46, Token:.
Entity: LABEL_1, Start:46, End:47, Token:.
Entity: LABEL_1, Start:47, End:48, Token:.
Entity: LABEL_1, Start:49, End:52, Token:per
Entity: LABEL_1, Start:52, End:54, Token:ov
Entity: LABEL_1, Start:54, End:57, Token:ski
Entity: LABEL_0, Start:57, End:59, Token:te
Entity: LABEL_0, Start:60, End:66, Token:family
Entity: LABEL_0, Start:66, End:67, Token:”
Entity: LABEL_0, Start:67, End:68, Token:.
Entity: LABEL_1, Start:69, End:72, Token:The
Entity: LABEL_0, Start:73, End:80, Token:current
Entity: LABEL_0, Start:81, End:84, Token:one
Entity: LABEL_0, Start:85, End:87, Token:is
Entity: LABEL_1, Start:88, End:93, Token:quite
Entity: LABEL_1, Start:94, End:103, Token:confusing
Entity: LABEL_0, Start:103, End:104, Token:.

wordpiece tokenizer

BERT tokenizer is a wordpiece tokenizer, i.e, a single word might get tokenzied into multiple tokens (for example perovskite in the above). It is a standard practice to use the entity of the first token in case if a word gets tokenized into multiple. You can use other strategies like max voting etc. Below is the logic to use to get the entities per words.

formatted_results = []
for result in results:
  end = result["start"]+len(result["word"].replace("##", ""))  
  
  if result["word"].startswith("##"):
    formatted_results[-1]["end"] = end
    formatted_results[-1]["word"]+= result["word"].replace("##", "")
  else:
    formatted_results.append({
        'start': result["start"], 
        'end': end,
        'entity': result["entity"],
        'index': result["index"],
        'score': result["score"],
        'word': result["word"]})

for result in formatted_results:
    print (f"""Entity: {result["entity"]}, Start:{result["start"]}, End:{result["end"]}, word:{text[result["start"]:result["end"]]}""")

Output:

Entity: LABEL_1, Start:0, End:1, word:3
Entity: LABEL_1, Start:1, End:2, word:)
Entity: LABEL_1, Start:3, End:10, word:Rewrite
Entity: LABEL_1, Start:11, End:14, word:the
Entity: LABEL_1, Start:15, End:19, word:last
Entity: LABEL_0, Start:20, End:28, word:sentence
Entity: LABEL_1, Start:29, End:30, word:“
Entity: LABEL_1, Start:30, End:38, word:scanning
Entity: LABEL_1, Start:39, End:44, word:probe
Entity: LABEL_1, Start:45, End:46, word:.
Entity: LABEL_1, Start:46, End:47, word:.
Entity: LABEL_0, Start:47, End:48, word:.
Entity: LABEL_1, Start:49, End:59, word:perovskite
Entity: LABEL_0, Start:60, End:66, word:family
Entity: LABEL_1, Start:66, End:67, word:”
Entity: LABEL_0, Start:67, End:68, word:.
Entity: LABEL_0, Start:69, End:72, word:The
Entity: LABEL_1, Start:73, End:80, word:current
Entity: LABEL_1, Start:81, End:84, word:one
Entity: LABEL_1, Start:85, End:87, word:is
Entity: LABEL_1, Start:88, End:93, word:quite
Entity: LABEL_1, Start:94, End:103, word:confusing
Entity: LABEL_0, Start:103, End:104, word:.

回复收藏 0 原文

成熟稳重的好男人 2025-02-13 17:50:40

在这里牢记的唯一细微点是，变形金刚的令牌不会使用白色空间拆分句子，而是将文本分解为子字（在大多数情况下！）。两个连续的＃ s总是表示子词的开始，可以利用此属性来重建原始单词。

考虑到这一点，将管道输出转换为doccano格式的无原则实现就是这样：

update update ：向函数oinder_text添加新参数将有助于重建重建发短信，摆脱额外的空间。

text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
output = [{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
def pipelineoutput_2_doccon(pipelineoutput, original_text, merge_labels=True):
  words = [item['word'] for item in pipelineoutput]
  labels = [item['entity'] for item in pipelineoutput]
  tags = []
  text = ''
  for idx, (w, l) in enumerate(zip(words, labels)):    
    if w.startswith('##'):
      length=len(text)
      text+=w[2:]
      tags.append([length, length+len(w[2:]), l, w])
    else:
      if text!='':
        text+=' '
      length=len(text)
      text+=w
      tags.append([length, length+len(w), l, w])
      length+=len(text)
  if not merge_labels:
    final_tags = []
    temp_text = original_text.lower()
    accumulated = 0
    for item in tags:
      
      start = temp_text.find(item[-1])
      end = start+len(item[-1])
      final_tags.append([accumulated+start, accumulated+end, item[-2]])
      temp_text = temp_text[end:]
      
      accumulated+=end
    return {"text": original_text, "label":final_tags}
  else:
    merged_tags = []
    for idx in range(len(tags)):
      if (tags[idx][1]==len(text) or text[tags[idx][1]]==' '):
        merged_tags.append(tags[idx])
      else:
        tags[idx+1][0] = tags[idx][0]
        tags[idx+1][-1] = tags[idx][-1]+tags[idx+1][-1].replace('##', '')
    final_tags = []
    temp_text = original_text.lower()
    accumulated = 0
    for item in merged_tags:
      
      start = temp_text.find(item[-1])
      end = start+len(item[-1])
      final_tags.append([accumulated+start, accumulated+end, item[-2]])
      temp_text = temp_text[end:]
      
      accumulated+=end
      
    return {"text": original_text, "label":final_tags}
pipelineoutput_2_doccon(output, text, merge_labels=False)

输出：

{'label': [[0, 1, 'LABEL_1'],
  [1, 2, 'LABEL_1'],
  [3, 10, 'LABEL_8'],
  [11, 14, 'LABEL_1'],
  [15, 19, 'LABEL_1'],
  [20, 28, 'LABEL_1'],
  [29, 30, 'LABEL_2'],
  [30, 38, 'LABEL_2'],
  [39, 44, 'LABEL_3'],
  [45, 46, 'LABEL_1'],
  [46, 47, 'LABEL_1'],
  [47, 48, 'LABEL_1'],
  [49, 59, 'LABEL_3'],
  [60, 66, 'LABEL_3'],
  [66, 67, 'LABEL_2'],
  [67, 68, 'LABEL_1'],
  [69, 72, 'LABEL_1'],
  [73, 80, 'LABEL_1'],
  [81, 84, 'LABEL_1'],
  [85, 87, 'LABEL_8'],
  [88, 93, 'LABEL_1'],
  [94, 103, 'LABEL_9'],
  [103, 104, 'LABEL_1']],
 'text': '3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing.'}

您可能会注意到标签分配正在子字级别发生。如果序列标签已获得有关任务的良好知识，则希望将唯一的标签分配给单词的子词，否则您可以选择按照符合您目标的方式来汇总标签。

The only subtle point for keeping in mind here is that the transformers tokenizer doesn't split sentences using white spaces, they break texts into subwords (in most cases!). Two consecutive #s always signify the start of a subword, this property can be exploited for reconstructing original words.

With this in mind an unprincipled implementation for converting the output of the pipeline into Doccano format would be like this:

UPDATE: adding a new argument to the function original_text would help with reconstructing text and getting rid of the extra spaces.

text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
output = [{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
def pipelineoutput_2_doccon(pipelineoutput, original_text, merge_labels=True):
  words = [item['word'] for item in pipelineoutput]
  labels = [item['entity'] for item in pipelineoutput]
  tags = []
  text = ''
  for idx, (w, l) in enumerate(zip(words, labels)):    
    if w.startswith('##'):
      length=len(text)
      text+=w[2:]
      tags.append([length, length+len(w[2:]), l, w])
    else:
      if text!='':
        text+=' '
      length=len(text)
      text+=w
      tags.append([length, length+len(w), l, w])
      length+=len(text)
  if not merge_labels:
    final_tags = []
    temp_text = original_text.lower()
    accumulated = 0
    for item in tags:
      
      start = temp_text.find(item[-1])
      end = start+len(item[-1])
      final_tags.append([accumulated+start, accumulated+end, item[-2]])
      temp_text = temp_text[end:]
      
      accumulated+=end
    return {"text": original_text, "label":final_tags}
  else:
    merged_tags = []
    for idx in range(len(tags)):
      if (tags[idx][1]==len(text) or text[tags[idx][1]]==' '):
        merged_tags.append(tags[idx])
      else:
        tags[idx+1][0] = tags[idx][0]
        tags[idx+1][-1] = tags[idx][-1]+tags[idx+1][-1].replace('##', '')
    final_tags = []
    temp_text = original_text.lower()
    accumulated = 0
    for item in merged_tags:
      
      start = temp_text.find(item[-1])
      end = start+len(item[-1])
      final_tags.append([accumulated+start, accumulated+end, item[-2]])
      temp_text = temp_text[end:]
      
      accumulated+=end
      
    return {"text": original_text, "label":final_tags}
pipelineoutput_2_doccon(output, text, merge_labels=False)

output:

{'label': [[0, 1, 'LABEL_1'],
  [1, 2, 'LABEL_1'],
  [3, 10, 'LABEL_8'],
  [11, 14, 'LABEL_1'],
  [15, 19, 'LABEL_1'],
  [20, 28, 'LABEL_1'],
  [29, 30, 'LABEL_2'],
  [30, 38, 'LABEL_2'],
  [39, 44, 'LABEL_3'],
  [45, 46, 'LABEL_1'],
  [46, 47, 'LABEL_1'],
  [47, 48, 'LABEL_1'],
  [49, 59, 'LABEL_3'],
  [60, 66, 'LABEL_3'],
  [66, 67, 'LABEL_2'],
  [67, 68, 'LABEL_1'],
  [69, 72, 'LABEL_1'],
  [73, 80, 'LABEL_1'],
  [81, 84, 'LABEL_1'],
  [85, 87, 'LABEL_8'],
  [88, 93, 'LABEL_1'],
  [94, 103, 'LABEL_9'],
  [103, 104, 'LABEL_1']],
 'text': '3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing.'}

As you may be noticed label assignment is happening at the subword level. If the sequence labeler has gained good knowledge about the task it's expected to assign unique labels to subwords of a word otherwise it's your choice to aggregate labels in the way which meets your goals the best.

回复收藏 0 原文

~没有更多了~