当前位置：文江博客话题详情

启动时用标签在子字符串中的最佳标记单词在＆amp;提供结尾指数[Python]

发布于 2025-01-24 13:30:01 字数 450 浏览 3 评论 0 原文

我正在尝试以串联格式格式化数据以进行NER任务（此信息在很大程度上无关紧要）。我要最佳实现的是 -

输入：

text：快速棕色狐狸跳过懒惰的狗。
indices： 10 -18 （棕色狐狸）， 35-42 （懒狗）

所需的输出：

The        O
quick      O
brown      X
fox        X
jumps      O
over       O
the        O
lazy       Y
dog        Y
.          O

是否有单个通行方法来执行此操作（因为我有一个很多例子 - 超过100k）？

原文

I'm trying to format data in the CoNLL format for a NER task (this info is largely irrelevant). What I want to optimally accomplish is this -

Input:

Text: The quick brown fox jumps over the lazy dog.
Indices: 10 - 18 (brown fox), 35 - 42 (lazy dog)

Desired Output:

The        O
quick      O
brown      X
fox        X
jumps      O
over       O
the        O
lazy       Y
dog        Y
.          O

Is there a single-pass way to do this (because I have a lot of examples -- over 100k)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌梦深 2025-01-31 13:30:01

编辑：修订后的输入

输入：

s = "The quick brown fox jumps over the lazy dog."
s = s.replace(".","")
tag_indices = { "X" : [[10,18], [20, 25]], "Y" : [[35, 42]]}

其他单词的正常标签

normal_tag = "O"

添加句子中所有单词的

words = s.split(" ")
tagged_words = {w:normal_tag for w in words}

，首先用 normal_tag 现在浏览每个标签，并在输入 tag_indices 。对于索引中的每个（启动，结束），将输入字符串切成 substring 。

对于此子字符串中的每个单词，将其适当地标记。

for tag,indices in tag_indices.items():
    for start,end in indices:
        substring = s[start:end+1]
        for word in substring.split(" "):
            if len(word)>0:
                tagged_words[word] = tag

最终输出是在 tagged_words 字典

for k,v in tagged_words.items():
    print(k,v)

输出中：

The O
quick O
brown X
fox X
jumps X
over O
the O
lazy Y
dog Y

EDIT : Revised input

Input :

s = "The quick brown fox jumps over the lazy dog."
s = s.replace(".","")
tag_indices = { "X" : [[10,18], [20, 25]], "Y" : [[35, 42]]}

Add a normal tag for the other words

normal_tag = "O"

Now for all the words in the sentence, first tag it with normal_tag

words = s.split(" ")
tagged_words = {w:normal_tag for w in words}

Now go through each tag and indices in the input tag_indices. For every (start, end) in the indices, slice the input string to get a substring.

For each word in this substring, tag it appropriately.

for tag,indices in tag_indices.items():
    for start,end in indices:
        substring = s[start:end+1]
        for word in substring.split(" "):
            if len(word)>0:
                tagged_words[word] = tag

The final output is in the tagged_words dictionary

for k,v in tagged_words.items():
    print(k,v)

Output :

The O
quick O
brown X
fox X
jumps X
over O
the O
lazy Y
dog Y

回复收藏 0 原文

~没有更多了~

关于作者

破晓

暂无简介

文章

26 人气

关注发私信

达拉崩吧

文章 0 评论 0

关注

PANGOO

文章 0 评论 0

关注

kkgtx

文章 0 评论 0

关注

WordPress小学生

文章 0 评论 0

关注

酷炫老祖宗

文章 0 评论 0

关注

硪扪都還晓

文章 0 评论 0

友情链接

文江博客

启动时用标签在子字符串中的最佳标记单词在＆amp;提供结尾指数[Python]

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

启动时用标签在子字符串中的最佳标记单词在＆amp;提供结尾指数[Python]

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。