使用 itertools 通过复合键对字典列表中的重复项求和

发布于 2025-01-11 02:46:16 字数 887 浏览 3 评论 0原文

我有一个排序的字典列表,如下所示:

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

这实际上是 (id1, id2, value) 元组,但其中存在重复项。我想通过对两个 id 相等的值求和来消除重复数据,留下唯一的 (id1, id2) 对,其中新值是重复值的总和。

也就是说,从上面来看,所需的输出是:

dat =[
     {'id1': 1, 'id2': 2, 'value': 3},
     {'id1': 2, 'id2': 2, 'value': 2},
     {'id1': 2, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 4, 'value': 4}
     ]

假设列表有数百万个,其中有很多重复项。使用 itertoolsfuncy (相对于使用 pandas)执行此操作的最有效方法是什么?

I have a sorted list of dictionaries like so:

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

This is effectively (id1, id2, value) tuples, but where there are duplicates. I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.

That is, from above, the desired output is:

dat =[
     {'id1': 1, 'id2': 2, 'value': 3},
     {'id1': 2, 'id2': 2, 'value': 2},
     {'id1': 2, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 4, 'value': 4}
     ]

Assume the list is millions with lots of duplicates. What's the most efficient way to do this using itertools or funcy (versus say, using pandas)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

南笙 2025-01-18 02:46:16

您可以从collections.Counter开始并使用+=运算符,Counter方便的部分是+= code> 假设不存在的键为零。

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

from collections import Counter
cnt = Counter()
for item in dat:
  cnt[item["id1"], item["id2"]] += item["value"]

[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]

给予

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]

You can start with collections.Counter and use the += operator, the convenient part of the Counter is that += assumes zero on inexisting keys.

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

from collections import Counter
cnt = Counter()
for item in dat:
  cnt[item["id1"], item["id2"]] += item["value"]

[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]

Giving

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]
陪你到最终 2025-01-18 02:46:16

我们也可以使用collections.defaultdict

from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
    tmp[d['id1'], d['id2']] += d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]

或者(假设id已排序),itertools.groupby

from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]

groupby + sum + pandas 中的 to_dict

out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')

输出:

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]


使用 itemgetter 提供的数据的基本基准显示 groupby (根据建议@ShadowRanger)是最快的:

  1. defaultdict:每次循环 6.57 µs ± 491 ns(平均值 ± 标准偏差,7 次运行,每次 100000 次循环)
  2. 计数器(Bob):9.56 µs ± 1.47 µs 每循环(7次运行的平均值±标准差,每次100000次循环)
  3. groupby + itemgetter (ShadowRanger) :每个循环 6.01 µs ± 182 ns(7 次运行的平均值 ± 标准偏差,100000 个循环每个)
  4. groupby + lambda:每次循环 9.02 µs ± 598 ns(7 次运行的平均值 ± 标准偏差,每次 100000 次循环)
  5. pandas:每次 3.81 ms ± 68.2 µs循环(7 次运行的平均值 ± 标准差,每次 100 次循环)

现在,如果我们复制 dat 1mil次,即

dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))

再次执行相同的基准测试,带有 itemgettergroupby 是最大的赢家:

  1. 每个循环 3.91 s ± 320 ms(平均值 ± 标准差)。开发人员。 7 次运行,每次 1 次循环)
  2. 每次循环 5.38 s ± 251 毫秒(平均值 ± 标准偏差,7 次运行,每次 1 次循环)
  3. 每次 1.77 s ± 128 毫秒循环(7 次运行的平均值 ± 标准偏差,每次 1 次循环)
  4. 每次循环 3.53 s ± 199 ms(7 次运行的平均值 ± 标准偏差)运行,每次 1 次循环)
  5. 每次循环 15.2 秒 ± 831 毫秒(7 次运行的平均值 ± 标准偏差,每次 1 次循环)

在 Python 3.9.7(64 位)上运行)

该基准测试在某种程度上有利于groupby,因为当我们复制现有的小字典列表时,组很少。如果创建随机“组”的大小,groupby + itemgetter 仍然击败所有,但差异并不那么明显。

We could use collections.defaultdict as well:

from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
    tmp[d['id1'], d['id2']] += d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]

or (assuming the ids are sorted), itertools.groupby:

from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]

or groupby + sum + to_dict in pandas:

out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')

Output:

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]


A basic benchmark on the provided data says groupby using itemgetter (as suggested by @ShadowRanger) is the fastest:

  1. defaultdict: 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  2. Counter (Bob): 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  3. groupby + itemgetter (ShadowRanger): 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  4. groupby + lambda: 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  5. pandas: 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now if we duplicate dat 1mil times, i.e. Do

dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))

and do the same benchmark again, groupby with itemgetter is the runaway winner:

  1. 3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  2. 5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  3. 1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  4. 3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  5. 15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ran on Python 3.9.7 (64bit).

This benchmark somewhat favors groupby since there are very few groups when we duplicate an existing small list of dicts. If create randomize the size of "group"s, groupby + itemgetter still beats all but the difference is not as stark.

谈场末日恋爱 2025-01-18 02:46:16

只是为了好玩,纯粹的itertools解决方案(不使用集合或以其他方式使用任何必须增量构建和更新的中间容器,如果列表已经按键顺序排列,但如果您不能保证它已经排序以将唯一的 id 对分组在一起,则需要进行预排序):

# At top of file
from itertools import groupby

# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')

# On each use:
dat.sort(key=get_ids)  # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example

dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(dat, key=get_ids)]

# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]

就我个人而言,我通常使用 Counterdefaultdict (int) 如所示其他答案,因为即使使用未排序的数据,它们也能获得 O(n) 性能(groupbyO(n),但如果您需要排序首先,排序是O(n log n))。基本上,唯一具有理论上优势的情况是当数据已经排序并且您重视使用单行(不包括导入和一次性设置成本来制作itemgetters);实际上,itertools.groupby 具有足够的开销,通常仍然会输给 Collections.Counter/collections.defaultdict(int) 之一或两者,特别是在其优化模式中使用collections.Counter来计算要计数的事物的可迭代次数时(此处不适用,但值得了解)。

Just for fun, a purely itertools solution (no use of collections or otherwise using any intermediate containers that must be built and updated incrementally if the list is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):

# At top of file
from itertools import groupby

# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')

# On each use:
dat.sort(key=get_ids)  # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example

dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(dat, key=get_ids)]

# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]

Personally, I'd generally use Counter or defaultdict(int) as shown in the other answers, as they get O(n) performance even with unsorted data (groupby is O(n), but if you need to sort first, the sorting is O(n log n)). Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetters); in practice, itertools.groupby has sufficient overhead that it still typically loses to one or both of collections.Counter/collections.defaultdict(int), especially when using collections.Counter in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文