如何从 pandas 数据框中的推文中提取主题标签？

发布于 2025-01-13 08:15:38 字数 579 浏览 0 评论 0原文

我有一个包含多个变量（列）的推文数据集，我想从推文（文本）中提取所有主题标签并将结果放入新列（主题标签）中。以下是我正在尝试的内容：

import pandas as pd
data = pd.read_csv("Sample.csv", lineterminator='\n')

def hashtags(string):
    Hash = data.text.str.findall(r'#.*?(?=\s|$)')
    return Hash
data['hashtags'] = data['text'].apply(lambda x: hashtags(x))

但是，当我运行函数主题标签时，我的笔记本卡住了（未完成执行或给出任何错误）。我的文件只有大约 10k 行。

另外，如果此代码成功运行，我期望得到如下内容：

[#asd, #fer, #gtr]

但我希望结果列应该只有主题标签的名称，如 [asd, fer, gtr]。请建议我应该在代码中进行哪些更改。

我尝试在之前提出的问题中寻找解决方案，但大多数问题都使用正则表达式，我正在寻找使用 pandas 的解决方案。

提前致谢。

原文

I have a dataset of tweets with several variable (columns) and I want to extract all the hashtags from a tweet (text) and place the result in a new column (hashtags). Below is what I am trying:

import pandas as pd
data = pd.read_csv("Sample.csv", lineterminator='\n')

def hashtags(string):
    Hash = data.text.str.findall(r'#.*?(?=\s|$)')
    return Hash
data['hashtags'] = data['text'].apply(lambda x: hashtags(x))

However, when I run the function hashtags, my notebook is just stuck (not finishing execution or giving any error). My file only have around 10k rows.

Also, if this code run successfully, I am expecting to get something like this:

[#asd, #fer, #gtr]

But I want the resultant column should have only name of the hashtags like [asd, fer, gtr]. Please suggest what change I should do in the code.

I tried to look for solution in previous asked questions but most of them used regular expression and I am looking for a solution using pandas.

Thanks in Advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

童话里做英雄 2025-01-20 08:15:38

我从这里下载了一些 .csv 格式的 Twitter 示例数据，https://twitter-sentiment-csv。 herokuapp.com/。在本示例中，我使用了前 10 行的一部分。

def find_tags(row_string):
    # use a list comprehension to find list items that start with #
    tags = [x for x in row_string if x.startswith('#')]
    
    return tags

df = pd.DataFrame({'sentiment': {0: 'neutral',
  1: 'neutral',
  2: 'neutral',
  3: 'neutral',
  4: 'neutral',
  5: 'neutral',
  6: 'neutral',
  7: 'positive',
  8: 'neutral',
  9: 'neutral'},
 'text': {0: 'RT @fakeTakeDump: TRAMS STELARA BICYCLE PINOCHLE JUMBO INDEX SEPTAVALENT TYPEWRITER HOMEBREWING AND ANTI-LOCK HULLO KITTY IN FORTUNE COOKIE…',
  1: 'RT @fauzanzain: Hi warga twitter, sekarang aku lagi cari career coach nih yang punya latar belakang tech recruiter / mid to senior digital…',
  2: 'RT @fakeTakeDump: WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA HANDMA…',
  3: 'WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA…',
  4: 'RT @KirkDBorne: Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60: by @NainaChaturved8 \\n———…',
  5: 'Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60:  by… ',
  6: 'RT @IBAConservative: @dax_christensen The truth is out! They can’t hold it back. \\n#CrimesAgainstHumanity \\n#TrudeauTyranny \\n#TrudeauMustResi…',
  7: "RT @drmwarner: As per these children's health organizations, keeping masks on in schools 2wks post March break would have made much more se…",
  8: 'RT @cryptotommy88: TL;DR\\n✅ Collective analytics business \\n✅ Draw power from data science & crowd-sourced knowledge\\n✅ 1st product PFPscore:…',
  9: 'RT @cryptotommy88: TL;DR\\n✅ Collective analytics business \\n✅ Draw power from data science & crowd-sourced knowledge\\n✅ 1st product PFPscore:…'},
 'user': {0: 'BotDuran',
  1: 'ezash',
  2: 'BlkHwk0ps',
  3: 'fakeTakeDump',
  4: 'RobotProud',
  5: 'KirkDBorne',
  6: 'cloudcnworld',
  7: 'NeuroTeck',
  8: 'BIGwinCutiejoy8',
  9: 'luckbigw1n'}})

df['split'] = df['text'].str.split(' ')

df['tags'] = df['split'].apply(lambda row : find_tags(row))
# replace # as requested in OP, replace for new lines and \ as needed.
df['tags'] = df['tags'].apply(lambda x : str(x).replace('#', '').replace('\\n', ',').replace('\\', '').replace("'", ""))

输出 df['tags']：

0                                []
1                                []
2                                []
3                                []
4    [DataScience, MachineLearning]
5    [DataScience, MachineLearning]
6                                []
7                                []
8                                []
9                                []
Name: tags, dtype: object

I downloaded some sample twitter data in a .csv from here, https://twitter-sentiment-csv.herokuapp.com/. I've used a slice of the first 10 rows for this example.

def find_tags(row_string):
    # use a list comprehension to find list items that start with #
    tags = [x for x in row_string if x.startswith('#')]
    
    return tags

df = pd.DataFrame({'sentiment': {0: 'neutral',
  1: 'neutral',
  2: 'neutral',
  3: 'neutral',
  4: 'neutral',
  5: 'neutral',
  6: 'neutral',
  7: 'positive',
  8: 'neutral',
  9: 'neutral'},
 'text': {0: 'RT @fakeTakeDump: TRAMS STELARA BICYCLE PINOCHLE JUMBO INDEX SEPTAVALENT TYPEWRITER HOMEBREWING AND ANTI-LOCK HULLO KITTY IN FORTUNE COOKIE…',
  1: 'RT @fauzanzain: Hi warga twitter, sekarang aku lagi cari career coach nih yang punya latar belakang tech recruiter / mid to senior digital…',
  2: 'RT @fakeTakeDump: WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA HANDMA…',
  3: 'WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA…',
  4: 'RT @KirkDBorne: Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60: by @NainaChaturved8 \\n———…',
  5: 'Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60:  by… ',
  6: 'RT @IBAConservative: @dax_christensen The truth is out! They can’t hold it back. \\n#CrimesAgainstHumanity \\n#TrudeauTyranny \\n#TrudeauMustResi…',
  7: "RT @drmwarner: As per these children's health organizations, keeping masks on in schools 2wks post March break would have made much more se…",
  8: 'RT @cryptotommy88: TL;DR\\n✅ Collective analytics business \\n✅ Draw power from data science & crowd-sourced knowledge\\n✅ 1st product PFPscore:…',
  9: 'RT @cryptotommy88: TL;DR\\n✅ Collective analytics business \\n✅ Draw power from data science & crowd-sourced knowledge\\n✅ 1st product PFPscore:…'},
 'user': {0: 'BotDuran',
  1: 'ezash',
  2: 'BlkHwk0ps',
  3: 'fakeTakeDump',
  4: 'RobotProud',
  5: 'KirkDBorne',
  6: 'cloudcnworld',
  7: 'NeuroTeck',
  8: 'BIGwinCutiejoy8',
  9: 'luckbigw1n'}})

df['split'] = df['text'].str.split(' ')

df['tags'] = df['split'].apply(lambda row : find_tags(row))
# replace # as requested in OP, replace for new lines and \ as needed.
df['tags'] = df['tags'].apply(lambda x : str(x).replace('#', '').replace('\\n', ',').replace('\\', '').replace("'", ""))

Output df['tags']:

0                                []
1                                []
2                                []
3                                []
4    [DataScience, MachineLearning]
5    [DataScience, MachineLearning]
6                                []
7                                []
8                                []
9                                []
Name: tags, dtype: object

回复收藏 0 原文

~没有更多了~