提取 SpaCy DATE 实体并添加到新的 pandas 列

发布于 2025-01-11 15:26:46 字数 1513 浏览 4 评论 0原文

我收集了一些社交媒体评论，我想根据它们对日期的引用进行探索。为此，我使用 SpaCy 的命名实体识别器来搜索 DATE 实体。我在 comment 列下的名为 df_test 的 pandas 数据框中添加了注释。我想向此数据框添加一个新列 dates ，其中包含每个评论中找到的所有日期实体。有些注释不会包含任何日期实体，在这种情况下，应在此处添加 None。例如：

comment
'bla bla 21st century'
'bla 1999 bla bla 2022'
'bla bla bla'

应该是：

comment                        dates
'bla bla 21st century'         '21st century'
'bla 1999 bla bla 2022'        '1999', '2022'
'bla bla bla'                  'None'

基于他们是添加在新列中找到的新 NER 标记的方法吗？我尝试过列表方法：

date_label = ['DATE']
dates_list = []

def get_dates(row):
    comment = str(df_test.comment.tolist())
    doc = nlp(comment)
    for ent in doc.ents:
        if ent.label_ in date_label:
            dates_list.append([ent.text])
        else:
            dates_list.append(['None'])

df_test.apply(lambda row: get_dates(row))
date_df_test = pd.DataFrame(dates_list, columns=['dates'])

但是，这会生成一个比原始数据框更长的列，喜欢：

comment                        dates
'bla bla 21st century'         '21st century'
'bla 1999 bla bla 2022'        '1999'
'bla bla bla'                  '2022'
                               'None'

哪个不工作，因为日期条目不再与其相应的注释匹配。我知道这是因为我在所有实体之间循环，但我不知道如何解决这个问题。有什么方法可以解决这个问题，以便我可以提取所有日期实体并以某种方式将它们连接到它们所在的注释以供以后分析之用？非常感谢任何帮助！

原文

I have a collection of social media comments that I want to explore based on their reference to dates. For this purpose, I am using SpaCy's Named Entity Recognizer to search for DATE entities. I have the comments in a pandas dataframe called df_test under the column comment. I would like to add a new column dates to this dataframe consisting of all the date entities found in each comment. Some comments are not going to have any date entities in which case None should be added here instead.
So for example:

comment
'bla bla 21st century'
'bla 1999 bla bla 2022'
'bla bla bla'

Should be:

comment                        dates
'bla bla 21st century'         '21st century'
'bla 1999 bla bla 2022'        '1999', '2022'
'bla bla bla'                  'None'

Based on Is their a way to add the new NER tag found in a new column? I have tried a list approach:

date_label = ['DATE']
dates_list = []

def get_dates(row):
    comment = str(df_test.comment.tolist())
    doc = nlp(comment)
    for ent in doc.ents:
        if ent.label_ in date_label:
            dates_list.append([ent.text])
        else:
            dates_list.append(['None'])

df_test.apply(lambda row: get_dates(row))
date_df_test = pd.DataFrame(dates_list, columns=['dates'])

However, this then produces a column that would be longer than the original dataframe, like:

comment                        dates
'bla bla 21st century'         '21st century'
'bla 1999 bla bla 2022'        '1999'
'bla bla bla'                  '2022'
                               'None'

Which doesn't work, since the entries of dates no longer matches with their corresponding comments. I understand that it is because I am for-looping across all entities, but I don't know how to work around this. Is there any way to solve this, so that I can extract all date entities and connect them in some way to the comment their were found in for the purpose of later analysis? Any help is much appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋凉 2025-01-18 15:26:46

我设法通过使用此功能找到了解决我自己问题的方法。

date_label = ['DATE']

def extract_dates(text):
    doc = nlp(text)
    results = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in date_label]
    return results

df_test['dates'] = df_test['comment'].apply(extract_dates)

我希望这可以帮助任何面临类似问题的人。

I managed to find a solution to my own problem by using this function.

date_label = ['DATE']

def extract_dates(text):
    doc = nlp(text)
    results = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in date_label]
    return results

df_test['dates'] = df_test['comment'].apply(extract_dates)

I hope this may help anyone who face a similar issue.

回复收藏 0 原文

~没有更多了~