如何获得原始单词级别实体而不是bert ner中的文字代币
我有一个训练有素的BERT模型,我愿意用来注释一些文本。
我正在以下面的方式使用变形金刚管道进行NER任务:
mode = AutoModelForTokenClassification.from_pretrained(<my_model_path>)
tokenize = BertTokenizer.from_pretrained(<my_model_path>)
nlp_ner = pipeline(
"ner",
model=mode,
tokenizer=tokenize
)
然后,我通过致电来获得预测结果:
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
result = nlp_ner(text)
返回的结果是:
[{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
我现在面临的问题是我想注释预测的类文本本身,但这看起来很复杂,因为预测结果不是通过用空间删除它们的索引单词,但是例如,组成的单词被视为多个单词等
。在doccanno json格式中)不太复杂吗?
我的目标是能够说:对于所有“ Label_9”,请用特定的HTML类突出显示初始文本。甚至更容易,找到所有预测为“ label_9”类的单词的开始和结束索引。
I have a trained BERT model that I am willing to use to annotate some text.
I am using the transformers pipeline for the NER task in the following way:
mode = AutoModelForTokenClassification.from_pretrained(<my_model_path>)
tokenize = BertTokenizer.from_pretrained(<my_model_path>)
nlp_ner = pipeline(
"ner",
model=mode,
tokenizer=tokenize
)
Then, I am obtaining the prediction results by calling:
text = "3) Rewrite the last sentence “scanning probe ... perovskite family”. The current one is quite confusing."
result = nlp_ner(text)
Where the returned result is:
[{'entity': 'LABEL_1', 'score': 0.99999774, 'index': 1, 'word': '3', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999979, 'index': 2, 'word': ')', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9999897, 'index': 3, 'word': 'rewrite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 4, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999962, 'index': 5, 'word': 'last', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999975, 'index': 6, 'word': 'sentence', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998623, 'index': 7, 'word': '“', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99997735, 'index': 8, 'word': 'scanning', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9941041, 'index': 9, 'word': 'probe', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.999994, 'index': 10, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999696, 'index': 11, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999976, 'index': 12, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.99998647, 'index': 13, 'word': 'per', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999939, 'index': 14, 'word': '##ovsk', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.99999154, 'index': 15, 'word': '##ite', 'start': None, 'end': None}, {'entity': 'LABEL_3', 'score': 0.9999942, 'index': 16, 'word': 'family', 'start': None, 'end': None}, {'entity': 'LABEL_2', 'score': 0.9997022, 'index': 17, 'word': '”', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999929, 'index': 18, 'word': '.', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999977, 'index': 19, 'word': 'the', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99999076, 'index': 20, 'word': 'current', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.99996257, 'index': 21, 'word': 'one', 'start': None, 'end': None}, {'entity': 'LABEL_8', 'score': 0.9169066, 'index': 22, 'word': 'is', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.6795164, 'index': 23, 'word': 'quite', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.7315716, 'index': 24, 'word': 'conf', 'start': None, 'end': None}, {'entity': 'LABEL_9', 'score': 0.9067044, 'index': 25, 'word': '##using', 'start': None, 'end': None}, {'entity': 'LABEL_1', 'score': 0.9999925, 'index': 26, 'word': '.', 'start': None, 'end': None}]
The problem I am facing now is that I would like to annotate the predicted classes back to the text itself, but this looks complicated as the prediction results are not indexing words by expolding them with a space, but for example, composed words are seen as multiple words, etc.
Is there a way to annotate back the text (for example, in Doccanno JSON format) that is not too complex?
My goal is to be able to say: For all the "LABEL_9", highlight the initial text with a specific html class. Or even easier, find the start and the end index for all words predicted as being of class "LABEL_9".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您在管道内使用的令牌器不会给出偏移信息,因此您在
start
的模型响应中均未获得。将其更改为快速令牌,您应该获得start
位置。解决方案
输出:
WordPiece Tokenizer
bert Tokenizer是一个文字令牌,即,一个单词可能会被将令发送到多个令牌中(例如,上面的
perovskite
)。如果一个单词被归为多个,则使用第一个令牌的实体是一种标准做法。您可以使用其他策略,例如最大投票等。以下是每个单词获取实体的逻辑。输出:
The Tokenizer you are using inside the pipeline does not give out the offset information and so you get None in the model response for
start
. Change it to Fast Tokenizer and you should get thestart
location.Solution
Output:
wordpiece tokenizer
BERT tokenizer is a wordpiece tokenizer, i.e, a single word might get tokenzied into multiple tokens (for example
perovskite
in the above). It is a standard practice to use the entity of the first token in case if a word gets tokenized into multiple. You can use other strategies like max voting etc. Below is the logic to use to get the entities per words.Output:
在这里牢记的唯一细微点是,变形金刚的令牌不会使用白色空间拆分句子,而是将文本分解为子字(在大多数情况下!)。两个连续的
#
s总是表示子词的开始,可以利用此属性来重建原始单词。考虑到这一点,将管道输出转换为doccano格式的无原则实现就是这样:
update update :向函数
oinder_text
添加新参数将有助于重建重建发短信,摆脱额外的空间。输出:
您可能会注意到标签分配正在子字级别发生。如果序列标签已获得有关任务的良好知识,则希望将唯一的标签分配给单词的子词,否则您可以选择按照符合您目标的方式来汇总标签。
The only subtle point for keeping in mind here is that the transformers tokenizer doesn't split sentences using white spaces, they break texts into subwords (in most cases!). Two consecutive
#
s always signify the start of a subword, this property can be exploited for reconstructing original words.With this in mind an unprincipled implementation for converting the output of the pipeline into Doccano format would be like this:
UPDATE: adding a new argument to the function
original_text
would help with reconstructing text and getting rid of the extra spaces.output:
As you may be noticed label assignment is happening at the subword level. If the sequence labeler has gained good knowledge about the task it's expected to assign unique labels to subwords of a word otherwise it's your choice to aggregate labels in the way which meets your goals the best.