如何标记多词实体?

发布于 2025-01-13 16:04:00 字数 1101 浏览 5 评论 0原文

我对数据分析(以及一般的Python)还很陌生,而且我目前在我的项目中有点陷入困境。

对于我的 NLP 任务,我需要创建训练数据,即查找句子中的特定实体并标记它们。我有多个 csv 文件,其中包含我要查找的实体,其中许多由多个单词组成。我使用 spaCy 对未标记的句子进行了标记和词形还原,并将它们加载到 pandas.DataFrame 中。

我的主要问题是:现在如何将标记化句子与实体列表进行比较并标记(通常是多个单词)实体?拥有大约 0.5 GB 的句子,我认为只 for 循环每个句子,然后 for 循环每个类列表中的每个实体并进行简单的子字符串搜索是不可行的。有没有什么聪明的方法可以使用 pandas.Series 或 DataFrame 来进行此标记?

如前所述,我对 pandas/numpy 等确实没有任何经验,并且经过大量网络搜索后,我似乎仍然没有找到问题的答案

说这是 Finance.csv 的样本,其中之一我的实体列表:

"Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation",
...

这是 sport.csv 的一个样本,另一个我的实体列表:

"Christiano Ronaldo",
"Lewis Hamilton",
...

以及一个示例(愚蠢)句子:

"Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"

我想要的结果类似于带有匹配实体标签的标记表(带 IOB 标签):

"Dear "- O
"members" - O
"of" - O
"Frontwave" - B-FINANCE
"Credit" - I-FINANCE
"Union" - I-FINANCE
"," - O
"any" - O
...
"Lewis" - B-SPORT
"Hamilton" - I-SPORT
...
"said" - O
"Ronaldo" - O

I'm quite new to data analysis (and Python in general), and I'm currently a bit stuck in my project.

For my NLP-task I need to create training data, i.e. find specific entities in sentences and label them. I have multiple csv files containing the entities I am trying to find, many of them consisting of multiple words. I have tokenized and lemmatized the unlabeled sentences with spaCy and loaded them into a pandas.DataFrame.

My main problem is: how do I now compare the tokenized sentences with the entity-lists and label the (often multi-word) entities? Having around 0.5 GB of sentences, I don't think it is feasible to just for-loop every sentence and then for-loop every entity in every class-list and do a simple substring-search. Is there any smart way to use pandas.Series or DataFrame to do this labeling?

As mentioned, I don't really have any experience regarding pandas/numpy etc. and after a lot of web searching I still haven't seemed to find the answer to my problem

Say that this is a sample of finance.csv, one of my entity lists:

"Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation",
...

And that this is a sample of sport.csv, another one of my entity lists:

"Christiano Ronaldo",
"Lewis Hamilton",
...

And an example (dumb) sentence:

"Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"

The result I'd like would be something like a table of tokens with the matching entity labels (with IOB labeling):

"Dear "- O
"members" - O
"of" - O
"Frontwave" - B-FINANCE
"Credit" - I-FINANCE
"Union" - I-FINANCE
"," - O
"any" - O
...
"Lewis" - B-SPORT
"Hamilton" - I-SPORT
...
"said" - O
"Ronaldo" - O

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

情丝乱 2025-01-20 16:04:00

使用:

FINANCE = ["Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation"]

SPORT = [
    "Christiano Ronaldo",
    "Lewis Hamilton",
]

FINANCE = '|'.join(FINANCE)
sent = pd.DataFrame({'sent': ["Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"]})
home = sent['sent'].str.extractall(f'({FINANCE})')

def labeler(row, group):
    l = len(row.split())
    return [f'I-{group}' if i !=0 else f'B-{group}' for i in range(l)]

home[0].apply(labeler, group='FINANCE').explode()

在此输入图像描述

Use:

FINANCE = ["Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation"]

SPORT = [
    "Christiano Ronaldo",
    "Lewis Hamilton",
]

FINANCE = '|'.join(FINANCE)
sent = pd.DataFrame({'sent': ["Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"]})
home = sent['sent'].str.extractall(f'({FINANCE})')

def labeler(row, group):
    l = len(row.split())
    return [f'I-{group}' if i !=0 else f'B-{group}' for i in range(l)]

home[0].apply(labeler, group='FINANCE').explode()

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文