Python运行时3X偏差32 vs 34个字符ID
我正在运行一个聚合脚本,该脚本在很大程度上依赖于标识符列上的聚合 /分组。由于哈希功能,该列中的每个标识符长32个字符。
因此,我的ID列将在Pandas Groupby中使用类似的条目
e667sad2345...1238a
。
我试图在某些样本中添加一个前缀“ ID”,以便更容易分离。因此,我有一些具有34个字符的标识符,还有其他32个字符。
e667sad2345...1238a
IDf7901ase323...1344b
现在,聚合脚本需要3倍(6000 vs 2000秒)。 ID列中的更改(<代码>添加前缀)是唯一发生的事情。另请注意,我分别生成数据并保存一个由我的聚合脚本读取的泡菜文件作为输入。因此,前缀添加不是我谈论的运行时的一部分。
因此,现在我感到震惊,为什么这种特殊的变化产生了如此巨大的影响。有人可以详细说明吗?
编辑:我用后缀替换了前缀,所以现在是
e667sad2345...1238a
f7901ase323...1344bID
在2000秒内再次运行。 Groupby是否使用二进制搜索或其他内容,因此所有ID都对起始字符“ i”的代表过多?
I am running an aggregation script, which heavily relies on aggregating / grouping on an identifier column. Each identifier in this column is 32 character long as a result of a hashing function.
so my ID column which will be used in pandas groupby has something like
e667sad2345...1238a
as an entry.
I tried to add a prefix "ID" to some of the samples, for easier separation afterwards. Thus, I had some identifiers with 34 characters and others still with 32 characters.
e667sad2345...1238a
IDf7901ase323...1344b
Now the aggregation script takes 3 times as long (6000 vs 2000 seconds). And the change in the ID column (adding the prefix
) is the only thing which happened. Also note, that I generate data separately and save a pickle file which is read in by my aggregation script as input. So the prefix addition is not part of the runtime I am talking about.
So now I am stunned, why this particular change made such a huge impact. Can someone elaborate?
EDIT: I replaced the prefix with suffix so now it is
e667sad2345...1238a
f7901ase323...1344bID
and now it runs again in 2000 seconds. Does groupby use a binary search or something, so all the ID are overrepresented with the starting character 'I' ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好的,我有一个启示录。
我的条目是使用快速排序对O(N * log N)的预期运行时间进行排序的。但是在最坏的情况下,快速排序实际上将在o(n*n)中运行。通过使我的条目不平衡(20%的数据以“ I”开头,其他80%随机分布在字母数字字符上),我将数据转移到更糟糕的情况下,可以快速排序。
Ok, I had a revelation what is going on.
My entries are sorted using quick sort, which has an expected runtime of O(n * log n). But in worst case, quick sort will actually run in O(n*n). By making my entries imbalanced (20% of data starts with "I", other 80% randomly distributed over alphanumeric characters) I shifted the data to be more of a bad case for quick sort.