计算列表中主题标签的性能方法(pandas)

发布于 2025-01-11 07:07:46 字数 399 浏览 2 评论 0 原文

我有一个包含约 7.000.000 行和很多列的数据框。

每一行都是一条推文,我有一列包含推文内容的文本

我仅为文本内的主题标签创建了一个新列:

df['hashtags'] = df.Tweets.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')

所以我有一个名为 hashtags 的列,每行包含一个列表结构:['#b747', '#test']

我想计算每个主题标签的数量,但我有大量的行。最有效的方法是什么?

I have a dataframe with ~7.000.000 rows and a lot of columns.

Each row is a Tweet, and i have a column text with tweet's content.

I created a new column just for hashtags inside text:

df['hashtags'] = df.Tweets.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')

So i have a column called hashtags with each row containing a list structure: ['#b747', '#test'].

I would like to count the number of each hashtag but i have a heavy number of rows. What is the most performatic way to do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

愿与i 2025-01-18 07:07:46

以下是一些不同的方法,以及时间,按速度排序(最快的第一个):

# setup
n = 10_000
df = pd.DataFrame({
    'hashtags': np.random.randint(0, int(np.sqrt(n)), (n, 10)).astype(str).tolist(),
})


# 1. using itertools.chain to build an iterator on the elements of the lists
from itertools import chain
%timeit Counter(chain(*df.hashtags))
# 7.35 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 2. as per @Psidom comment
%timeit df.hashtags.explode().value_counts()
# 8.06 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 3. using Counter constructor, but specifying an iterator, not a list
%timeit Counter(h for hl in df.hashtags for h in hl)
# 10.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 4. iterating explicitly and using Counter().update()
def count5(s):
    c = Counter()
    for hl in s:
        c.update(hl)
    return c
%timeit count5(df.hashtags)
# 12.4 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 5. using itertools.reduce on Counter().update()
%timeit reduce(lambda x,y: x.update(y) or x, df.hashtags, Counter())
# 13.7 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 6. as per @EzerK
%timeit Counter(sum(df['hashtags'].values, []))
# 2.58 s ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

结论:最快的是#1(使用 Counter(chain(*df.hashtags))),但更直观自然#2(来自@Psidom 评论)几乎一样快。我可能会同意。 #6(@EzerK 方法)对于大型 df 来说非常慢,因为我们在将其作为参数传递给 Counter() 之前正在构建一个新的(长)列表。

Here are some different approaches, along with timing, ordered by speed (fastest first):

# setup
n = 10_000
df = pd.DataFrame({
    'hashtags': np.random.randint(0, int(np.sqrt(n)), (n, 10)).astype(str).tolist(),
})


# 1. using itertools.chain to build an iterator on the elements of the lists
from itertools import chain
%timeit Counter(chain(*df.hashtags))
# 7.35 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 2. as per @Psidom comment
%timeit df.hashtags.explode().value_counts()
# 8.06 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 3. using Counter constructor, but specifying an iterator, not a list
%timeit Counter(h for hl in df.hashtags for h in hl)
# 10.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 4. iterating explicitly and using Counter().update()
def count5(s):
    c = Counter()
    for hl in s:
        c.update(hl)
    return c
%timeit count5(df.hashtags)
# 12.4 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 5. using itertools.reduce on Counter().update()
%timeit reduce(lambda x,y: x.update(y) or x, df.hashtags, Counter())
# 13.7 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 6. as per @EzerK
%timeit Counter(sum(df['hashtags'].values, []))
# 2.58 s ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion: the fastest is #1 (using Counter(chain(*df.hashtags))), but the more intuitive and natural #2 (from @Psidom comment) is almost as fast. I would probably go with that. #6 (@EzerK approach) is very slow for large df slow because we are building a new (long) list before passing it as argument to Counter().

无人接听 2025-01-18 07:07:46

您可以将所有列表合并为一个大列表,然后使用 collections.Counter:

import pandas as pd
from collections import Counter

df = pd.DataFrame()
df['hashtags'] = [['#b747', '#test'], ['#b747', '#test']]

Counter(sum(df['hashtags'].values, []))

you can all the lists to one big list and then use collections.Counter:

import pandas as pd
from collections import Counter

df = pd.DataFrame()
df['hashtags'] = [['#b747', '#test'], ['#b747', '#test']]

Counter(sum(df['hashtags'].values, []))
动听の歌 2025-01-18 07:07:46

尝试了 Pierre 答案中的解决方案 1、2 和 3,因为它们的性能最高,而且我想要一两行代码。另外,我希望获得每条推文(即每行)的主题标签计数,而不是数据集中使用的所有不同主题标签(包括其频率计数)的字典。我对皮埃尔建议的所有努力和不同方法投了赞成票。

不幸的是,对于我使用的 Twitter 数据,解决方案 1、2 和 3 都返回 NaN。还尝试了建议的 tweets_df.Tweets.str.extractall(r'(\#\w+)')[0].value_counts() 的不同变体/questions/49865756/extract-and-count-hashtags-from-a-dataframe">此处 但它们都不起作用。

最终起作用的是这两行在

%timeit df['hashtags'] = df['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
15.2 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df['hashtagCount']  = df.hashtags.agg([len])
1.11 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

性能方面,提取主题标签的代码使用更少的时间:

#alternative solution to extract hashtags and put in new column with apply
%timeit df['hashtags'] = df['text'].apply(lambda twt : re.findall(r"(#(?:[^\x00-\x7F]|\w)+)", twt))
3.68 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Tried solutions 1, 2 and 3 from Pierre's answer as they are most performant and I wanted one or two lines of code. Also I wanted to have the count of hashtags per tweet, i.e. per row, and not a dictionary of all the different hashtags used in the dataset including their frequency count. An up-vote from me for all the effort and different methods Pierre suggests.

Unfortunately, solutions 1, 2 and 3 all returned NaNs for the Twitter data I work with. Also tried different variants of tweets_df.Tweets.str.extractall(r'(\#\w+)')[0].value_counts() suggested here but none of them worked.

What worked in the end are these two lines

%timeit df['hashtags'] = df['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
15.2 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df['hashtagCount']  = df.hashtags.agg([len])
1.11 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

On the note of performance, this code to extract hashtags uses less time:

#alternative solution to extract hashtags and put in new column with apply
%timeit df['hashtags'] = df['text'].apply(lambda twt : re.findall(r"(#(?:[^\x00-\x7F]|\w)+)", twt))
3.68 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文