如何标记多词实体?
我对数据分析(以及一般的Python)还很陌生,而且我目前在我的项目中有点陷入困境。
对于我的 NLP 任务,我需要创建训练数据,即查找句子中的特定实体并标记它们。我有多个 csv 文件,其中包含我要查找的实体,其中许多由多个单词组成。我使用 spaCy 对未标记的句子进行了标记和词形还原,并将它们加载到 pandas.DataFrame
中。
我的主要问题是:现在如何将标记化句子与实体列表进行比较并标记(通常是多个单词)实体?拥有大约 0.5 GB 的句子,我认为只 for 循环每个句子,然后 for 循环每个类列表中的每个实体并进行简单的子字符串搜索是不可行的。有没有什么聪明的方法可以使用 pandas.Series 或 DataFrame 来进行此标记?
如前所述,我对 pandas/numpy 等确实没有任何经验,并且经过大量网络搜索后,我似乎仍然没有找到问题的答案
说这是 Finance.csv 的样本,其中之一我的实体列表:
"Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation",
...
这是 sport.csv 的一个样本,另一个我的实体列表:
"Christiano Ronaldo",
"Lewis Hamilton",
...
以及一个示例(愚蠢)句子:
"Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"
我想要的结果类似于带有匹配实体标签的标记表(带 IOB 标签):
"Dear "- O
"members" - O
"of" - O
"Frontwave" - B-FINANCE
"Credit" - I-FINANCE
"Union" - I-FINANCE
"," - O
"any" - O
...
"Lewis" - B-SPORT
"Hamilton" - I-SPORT
...
"said" - O
"Ronaldo" - O
I'm quite new to data analysis (and Python in general), and I'm currently a bit stuck in my project.
For my NLP-task I need to create training data, i.e. find specific entities in sentences and label them. I have multiple csv files containing the entities I am trying to find, many of them consisting of multiple words. I have tokenized and lemmatized the unlabeled sentences with spaCy and loaded them into a pandas.DataFrame
.
My main problem is: how do I now compare the tokenized sentences with the entity-lists and label the (often multi-word) entities? Having around 0.5 GB of sentences, I don't think it is feasible to just for-loop every sentence and then for-loop every entity in every class-list and do a simple substring-search. Is there any smart way to use pandas.Series or DataFrame to do this labeling?
As mentioned, I don't really have any experience regarding pandas/numpy etc. and after a lot of web searching I still haven't seemed to find the answer to my problem
Say that this is a sample of finance.csv, one of my entity lists:
"Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation",
...
And that this is a sample of sport.csv, another one of my entity lists:
"Christiano Ronaldo",
"Lewis Hamilton",
...
And an example (dumb) sentence:
"Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"
The result I'd like would be something like a table of tokens with the matching entity labels (with IOB labeling):
"Dear "- O
"members" - O
"of" - O
"Frontwave" - B-FINANCE
"Credit" - I-FINANCE
"Union" - I-FINANCE
"," - O
"any" - O
...
"Lewis" - B-SPORT
"Hamilton" - I-SPORT
...
"said" - O
"Ronaldo" - O
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用:
Use: