在 Python 3.9 中使用 Spacy 从数据框中删除名称

发布于 2025-01-11 16:43:10 字数 1303 浏览 1 评论 0原文

我正在 Python 3.9 中使用 spacy 包 v3.2.1,并想了解如何使用它从数据框中删除名称。我尝试遵循 spacy 文档,并且能够正确识别名称,但不明白如何删除它们。我的目标是删除数据框特定列中的所有名称。

实际

IDComment
A123我五岁了,我的名字叫约翰
X907今天我见到了雅各布博士

我想要完成的事情

IDComment
A123我五岁了老了,我叫
X907今天我遇到了代码博士

#loading packages
import spacy
import pandas as pd
from spacy import displacy


#loading CSV
df = pd.read_csv('names.csv)

#loading spacy large model
nlp = spacy.load("en_core_web_lg")

#checking/testing is spacy large is identifying named entities
df['test_col'] = df['Comment'].apply(lambda x: list(nlp(x).ents)) 

我的代码做了什么

IDCommenttest_col
A123我五岁了,我叫约翰[(John)]
X907今天我会见了雅各布博士[(Jacob)]

但我该如何从评论栏中删除这些名字呢?我认为我有某种函数可以迭代数据帧的每一行并删除已识别的实体。非常感谢您的帮助

,谢谢

I am working with spacy package v3.2.1 in Python 3.9 and wanted to understand how I can use it to remove names from a data frame. I tried following the spacy documentation and I am able to identity names correctly, but not understanding how I can remove them. My goal is to remove all names from a specific column of the data frame.

Actual

IDComment
A123I am five years old, and my name is John
X907Today I met with Dr. Jacob

What I am trying to accomplish

IDComment
A123I am five years old, and my name is
X907Today I met with Dr.

Code:

#loading packages
import spacy
import pandas as pd
from spacy import displacy


#loading CSV
df = pd.read_csv('names.csv)

#loading spacy large model
nlp = spacy.load("en_core_web_lg")

#checking/testing is spacy large is identifying named entities
df['test_col'] = df['Comment'].apply(lambda x: list(nlp(x).ents)) 

What my code does

IDCommenttest_col
A123I am five years old, and my name is John[(John)]
X907Today I met with Dr. Jacob[(Jacob)]

But how do I go from removing those names from the Comment column? I think I some sort of function that iterates over each row of the data frame and removes the identified entities. Would appreciate your help

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谜泪 2025-01-18 16:43:10

您可以使用

import spacy
import pandas as pd

# Test dataframe
df = pd.DataFrame({'ID':['A123','X907'], 'Comment':['I am five years old, and my name is John', 'Today I met with Dr. Jacob']})

# Initialize the model
nlp = spacy.load('en_core_web_trf')

def remove_names(text):
    doc = nlp(text)
    newString = text
    for e in reversed(doc.ents):
        if e.label_ == "PERSON": # Only if the entity is a PERSON
            newString = newString[:e.start_char] + newString[e.start_char + len(e.text):]
    return newString

df['Comment'] = df['Comment'].apply(remove_names)
print(df.to_string())

输出:

     ID                               Comment
0  A123  I am five years old, and my name is
1  X907                 Today I met with Dr.

You can use

import spacy
import pandas as pd

# Test dataframe
df = pd.DataFrame({'ID':['A123','X907'], 'Comment':['I am five years old, and my name is John', 'Today I met with Dr. Jacob']})

# Initialize the model
nlp = spacy.load('en_core_web_trf')

def remove_names(text):
    doc = nlp(text)
    newString = text
    for e in reversed(doc.ents):
        if e.label_ == "PERSON": # Only if the entity is a PERSON
            newString = newString[:e.start_char] + newString[e.start_char + len(e.text):]
    return newString

df['Comment'] = df['Comment'].apply(remove_names)
print(df.to_string())

Output:

     ID                               Comment
0  A123  I am five years old, and my name is
1  X907                 Today I met with Dr.
︶葆Ⅱㄣ 2025-01-18 16:43:10

这是使用字符串 replace 方法的想法:

编辑:去掉括号看看是否有帮助。

df['test_col'] = df['Comment'].apply(lambda x: str(x).replace(str(nlp(x).ents).lstrip('(').rstrip(')')), '')

我对变量进行了类型转换以帮助匹配,也不确定它是否是 str 。您可能需要使用索引,如果在单个注释中找到多个名称,则需要循环它,但这就是它的要点。

Here's an idea using the string replace method:

EDIT: Stripping parens off to see if that helps.

df['test_col'] = df['Comment'].apply(lambda x: str(x).replace(str(nlp(x).ents).lstrip('(').rstrip(')')), '')

I typecasted the variables to help with the match, also not sure if it is a str or not. You may need to use an index, and loop it if there are multiple names found in a single comment, but that's the gist of it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文