使用 groupby 更快地重新格式化数据
所以我有一个看起来像这样的 DataFrame:
import pandas as pd
ddd = {
'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'b': [22, 25, 18, 53, 19, 8, 75, 11, 49, 64],
'c': [1, 1, 1, 2, 2, 3, 4, 4, 4, 5]
}
df = pd.DataFrame(ddd)
我需要的是按 'c'
列对数据进行分组并应用一些数据转换。目前我正在这样做:
def do_stuff(d: pd.DataFrame):
if d.shape[0] >= 2:
return pd.DataFrame(
{
'start': [d.a.values[0]],
'end': [d.a.values[d.shape[0] - 1]],
'foo': [d.a.sum()],
'bar': [d.b.mean()]
}
)
else:
return pd.DataFrame()
r = df.groupby('c').apply(lambda x: do_stuff(x))
这给出了正确的结果:
start end foo bar
c
1 0 1.0 3.0 6.0 21.666667
2 0 4.0 5.0 9.0 36.000000
4 0 7.0 9.0 24.0 45.000000
问题是这种方法似乎太慢了。根据我的实际数据,它的运行时间约为 0.7 秒,这太长了,理想情况下需要更快。
有什么办法可以让我更快地做到这一点吗?或者也许我可以使用其他一些不涉及 groupby
的更快方法?
So I have a DataFrame that looks something along these lines:
import pandas as pd
ddd = {
'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'b': [22, 25, 18, 53, 19, 8, 75, 11, 49, 64],
'c': [1, 1, 1, 2, 2, 3, 4, 4, 4, 5]
}
df = pd.DataFrame(ddd)
What I need is to group the data by the 'c'
column and apply some data transformations. At the moment I'm doing this:
def do_stuff(d: pd.DataFrame):
if d.shape[0] >= 2:
return pd.DataFrame(
{
'start': [d.a.values[0]],
'end': [d.a.values[d.shape[0] - 1]],
'foo': [d.a.sum()],
'bar': [d.b.mean()]
}
)
else:
return pd.DataFrame()
r = df.groupby('c').apply(lambda x: do_stuff(x))
Which gives the correct result:
start end foo bar
c
1 0 1.0 3.0 6.0 21.666667
2 0 4.0 5.0 9.0 36.000000
4 0 7.0 9.0 24.0 45.000000
The problem is that this approach appears to be too slow. On my actual data it runs in around 0.7 seconds which is too long and needs to be ideally much faster.
Is there any way I can do this faster? Or maybe there's some other faster method not involving groupby
that I could use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我们可以首先过滤 df 来查找出现 2 次或更多次的“c”值;然后使用
groupby
+命名聚合:您也可以这样做:
或
输出:
一些基准:
We could first filter
df
for the "c" values that appear 2 or more times; then usegroupby
+ named aggregation:You could also do:
or
Output:
Some benchmarks: