提取 SpaCy DATE 实体并添加到新的 pandas 列
我收集了一些社交媒体评论,我想根据它们对日期的引用进行探索。为此,我使用 SpaCy 的命名实体识别器来搜索 DATE
实体。我在 comment
列下的名为 df_test
的 pandas 数据框中添加了注释。我想向此数据框添加一个新列 dates
,其中包含每个评论中找到的所有日期实体。有些注释不会包含任何日期实体,在这种情况下,应在此处添加 None
。 例如:
comment
'bla bla 21st century'
'bla 1999 bla bla 2022'
'bla bla bla'
应该是:
comment dates
'bla bla 21st century' '21st century'
'bla 1999 bla bla 2022' '1999', '2022'
'bla bla bla' 'None'
基于 他们是添加在新列中找到的新 NER 标记的方法吗? 我尝试过列表方法:
date_label = ['DATE']
dates_list = []
def get_dates(row):
comment = str(df_test.comment.tolist())
doc = nlp(comment)
for ent in doc.ents:
if ent.label_ in date_label:
dates_list.append([ent.text])
else:
dates_list.append(['None'])
df_test.apply(lambda row: get_dates(row))
date_df_test = pd.DataFrame(dates_list, columns=['dates'])
但是,这会生成一个比原始数据框更长的列,喜欢:
comment dates
'bla bla 21st century' '21st century'
'bla 1999 bla bla 2022' '1999'
'bla bla bla' '2022'
'None'
哪个不工作,因为日期条目不再与其相应的注释匹配。我知道这是因为我在所有实体之间循环,但我不知道如何解决这个问题。有什么方法可以解决这个问题,以便我可以提取所有日期实体并以某种方式将它们连接到它们所在的注释以供以后分析之用?非常感谢任何帮助!
I have a collection of social media comments that I want to explore based on their reference to dates. For this purpose, I am using SpaCy's Named Entity Recognizer to search for DATE
entities. I have the comments in a pandas dataframe called df_test
under the column comment
. I would like to add a new column dates
to this dataframe consisting of all the date entities found in each comment. Some comments are not going to have any date entities in which case None
should be added here instead.
So for example:
comment
'bla bla 21st century'
'bla 1999 bla bla 2022'
'bla bla bla'
Should be:
comment dates
'bla bla 21st century' '21st century'
'bla 1999 bla bla 2022' '1999', '2022'
'bla bla bla' 'None'
Based on Is their a way to add the new NER tag found in a new column? I have tried a list approach:
date_label = ['DATE']
dates_list = []
def get_dates(row):
comment = str(df_test.comment.tolist())
doc = nlp(comment)
for ent in doc.ents:
if ent.label_ in date_label:
dates_list.append([ent.text])
else:
dates_list.append(['None'])
df_test.apply(lambda row: get_dates(row))
date_df_test = pd.DataFrame(dates_list, columns=['dates'])
However, this then produces a column that would be longer than the original dataframe, like:
comment dates
'bla bla 21st century' '21st century'
'bla 1999 bla bla 2022' '1999'
'bla bla bla' '2022'
'None'
Which doesn't work, since the entries of dates no longer matches with their corresponding comments. I understand that it is because I am for-looping across all entities, but I don't know how to work around this. Is there any way to solve this, so that I can extract all date entities and connect them in some way to the comment their were found in for the purpose of later analysis? Any help is much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我设法通过使用此功能找到了解决我自己问题的方法。
我希望这可以帮助任何面临类似问题的人。
I managed to find a solution to my own problem by using this function.
I hope this may help anyone who face a similar issue.