使用 itertools 通过复合键对字典列表中的重复项求和
我有一个排序的字典列表,如下所示:
dat = [
{"id1": 1, "id2": 2, "value": 1},
{"id1": 1, "id2": 2, "value": 2},
{"id1": 2, "id2": 2, "value": 2},
{"id1": 2, "id2": 3, "value": 1},
{"id1": 3, "id2": 3, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
]
这实际上是 (id1, id2, value) 元组,但其中存在重复项。我想通过对两个 id 相等的值求和来消除重复数据,留下唯一的 (id1, id2) 对,其中新值是重复值的总和。
也就是说,从上面来看,所需的输出是:
dat =[
{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}
]
假设列表有数百万个,其中有很多重复项。使用 itertools
或 funcy
(相对于使用 pandas)执行此操作的最有效方法是什么?
I have a sorted list of dictionaries like so:
dat = [
{"id1": 1, "id2": 2, "value": 1},
{"id1": 1, "id2": 2, "value": 2},
{"id1": 2, "id2": 2, "value": 2},
{"id1": 2, "id2": 3, "value": 1},
{"id1": 3, "id2": 3, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
]
This is effectively (id1, id2, value) tuples, but where there are duplicates. I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.
That is, from above, the desired output is:
dat =[
{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}
]
Assume the list is millions with lots of duplicates. What's the most efficient way to do this using itertools
or funcy
(versus say, using pandas)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以从
collections.Counter
开始并使用+=
运算符,Counter
方便的部分是+=
code> 假设不存在的键为零。给予
You can start with
collections.Counter
and use the+=
operator, the convenient part of theCounter
is that+=
assumes zero on inexisting keys.Giving
我们也可以使用
collections.defaultdict
:或者(假设id已排序),
itertools.groupby
:或
groupby
+sum
+pandas
中的to_dict
:输出:
使用
itemgetter
提供的数据的基本基准显示groupby
(根据建议@ShadowRanger)是最快的:每次循环 6.57 µs ± 491 ns(平均值 ± 标准偏差,7 次运行,每次 100000 次循环)
9.56 µs ± 1.47 µs 每循环(7次运行的平均值±标准差,每次100000次循环)
每个循环 6.01 µs ± 182 ns(7 次运行的平均值 ± 标准偏差,100000 个循环每个)
每次循环 9.02 µs ± 598 ns(7 次运行的平均值 ± 标准偏差,每次 100000 次循环)
每次 3.81 ms ± 68.2 µs循环(7 次运行的平均值 ± 标准差,每次 100 次循环)
现在,如果我们复制
dat
1mil次,即再次执行相同的基准测试,带有
itemgetter
的groupby
是最大的赢家:每个循环 3.91 s ± 320 ms(平均值 ± 标准差)。开发人员。 7 次运行,每次 1 次循环)
每次循环 5.38 s ± 251 毫秒(平均值 ± 标准偏差,7 次运行,每次 1 次循环)
每次 1.77 s ± 128 毫秒循环(7 次运行的平均值 ± 标准偏差,每次 1 次循环)
每次循环 15.2 秒 ± 831 毫秒(7 次运行的平均值 ± 标准偏差,每次 1 次循环)
在 Python 3.9.7(64 位)上运行)。
该基准测试在某种程度上有利于
groupby
,因为当我们复制现有的小字典列表时,组很少。如果创建随机“组”的大小,groupby
+itemgetter
仍然击败所有,但差异并不那么明显。We could use
collections.defaultdict
as well:or (assuming the ids are sorted),
itertools.groupby
:or
groupby
+sum
+to_dict
inpandas
:Output:
A basic benchmark on the provided data says
groupby
usingitemgetter
(as suggested by @ShadowRanger) is the fastest:6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now if we duplicate
dat
1mil times, i.e. Doand do the same benchmark again,
groupby
withitemgetter
is the runaway winner:3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ran on Python 3.9.7 (64bit).
This benchmark somewhat favors
groupby
since there are very few groups when we duplicate an existing small list of dicts. If create randomize the size of "group"s,groupby
+itemgetter
still beats all but the difference is not as stark.只是为了好玩,纯粹的
itertools
解决方案(不使用集合
或以其他方式使用任何必须增量构建和更新的中间容器,如果列表
已经按键顺序排列,但如果您不能保证它已经排序以将唯一的 id 对分组在一起,则需要进行预排序):就我个人而言,我通常使用
Counter
或defaultdict (int)
如所示其他答案,因为即使使用未排序的数据,它们也能获得O(n)
性能(groupby
是O(n)
,但如果您需要排序首先,排序是O(n log n)
)。基本上,唯一具有理论上优势的情况是当数据已经排序并且您重视使用单行(不包括导入和一次性设置成本来制作itemgetters);实际上,itertools.groupby 具有足够的开销,通常仍然会输给 Collections.Counter/collections.defaultdict(int) 之一或两者,特别是在其优化模式中使用
collections.Counter
来计算要计数的事物的可迭代次数时(此处不适用,但值得了解)。Just for fun, a purely
itertools
solution (no use ofcollections
or otherwise using any intermediate containers that must be built and updated incrementally if thelist
is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):Personally, I'd generally use
Counter
ordefaultdict(int)
as shown in the other answers, as they getO(n)
performance even with unsorted data (groupby
isO(n)
, but if you need to sort first, the sorting isO(n log n)
). Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to makeitemgetter
s); in practice,itertools.groupby
has sufficient overhead that it still typically loses to one or both ofcollections.Counter
/collections.defaultdict(int)
, especially when usingcollections.Counter
in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).