Python运行时3X偏差32 vs 34个字符ID

发布于 2025-02-04 00:51:28 字数 627 浏览 3 评论 0原文

我正在运行一个聚合脚本，该脚本在很大程度上依赖于标识符列上的聚合 /分组。由于哈希功能，该列中的每个标识符长32个字符。

因此，我的ID列将在Pandas Groupby中使用类似的条目

e667sad2345...1238a

。

我试图在某些样本中添加一个前缀“ ID”，以便更容易分离。因此，我有一些具有34个字符的标识符，还有其他32个字符。

e667sad2345...1238a
IDf7901ase323...1344b

现在，聚合脚本需要3倍（6000 vs 2000秒）。 ID列中的更改（<代码>添加前缀）是唯一发生的事情。另请注意，我分别生成数据并保存一个由我的聚合脚本读取的泡菜文件作为输入。因此，前缀添加不是我谈论的运行时的一部分。

因此，现在我感到震惊，为什么这种特殊的变化产生了如此巨大的影响。有人可以详细说明吗？

编辑：我用后缀替换了前缀，所以现在是

e667sad2345...1238a
f7901ase323...1344bID

在2000秒内再次运行。 Groupby是否使用二进制搜索或其他内容，因此所有ID都对起始字符“ i”的代表过多？

原文

I am running an aggregation script, which heavily relies on aggregating / grouping on an identifier column. Each identifier in this column is 32 character long as a result of a hashing function.

so my ID column which will be used in pandas groupby has something like

e667sad2345...1238a

as an entry.

I tried to add a prefix "ID" to some of the samples, for easier separation afterwards. Thus, I had some identifiers with 34 characters and others still with 32 characters.

e667sad2345...1238a
IDf7901ase323...1344b

Now the aggregation script takes 3 times as long (6000 vs 2000 seconds). And the change in the ID column (adding the prefix) is the only thing which happened. Also note, that I generate data separately and save a pickle file which is read in by my aggregation script as input. So the prefix addition is not part of the runtime I am talking about.

So now I am stunned, why this particular change made such a huge impact. Can someone elaborate?

EDIT: I replaced the prefix with suffix so now it is

e667sad2345...1238a
f7901ase323...1344bID

and now it runs again in 2000 seconds. Does groupby use a binary search or something, so all the ID are overrepresented with the starting character 'I' ?

分享到QQ

分享到微博