DBI 的 fetchall_hashref 和 fetchall_arrayref 之间的性能差异是什么?
我正在编写一些 Perl 脚本来操作两个 PostgreSQL 数据库中的大量数据(总共约 4200 万行,但不会一次性完成)。
对于我的一些查询,使用 fetchall_hashref 很有意义,因为我有合成密钥。但是,在其他情况下,我将使用三列的数组作为唯一键。
这让我想知道 fetchall_arrayref 和 fetchall_hashref 之间的性能差异。我知道在这两种情况下,所有内容都会进入内存,因此选择几 GB 的数据可能不是一个好主意,但除此之外,文档中似乎很少有关于性能的指导。
我的谷歌搜索不成功,所以如果有人能指出我一些一般性能研究的方向,我将不胜感激。
(我知道我可以自己进行基准测试,但不幸的是,出于开发目的,我无法访问具有与生产相同硬件的机器,这就是为什么我正在寻找一般准则甚至最佳实践)。
I am writing some Perl scripts to manipulate large amounts (in total about 42 million rows, but it won't be done in one hit) of data in two PostgreSQL databases.
For some of my queries it makes good sense to use fetchall_hashref
because I have synthetic keys. However, in other instances, I'm going to have use an array of three columns as the unique key.
This has got me wondering about performance differences between fetchall_arrayref
and fetchall_hashref
. I know that in both cases everything is going in to memory so selecting several GB of data probably isn't a good idea but other than that there appears to be very little guidance in the documentation when it comes to performance.
My googling has been unsuccessful so if anyone can point me in the direction of some general performance studies I'd be grateful.
(I know I could benchmark this myself but unfortunately for dev purposes I don't have access to a machine which has identical hardware to production which is why I'm looking for general guidelines or even best practices).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
获取方法之间的大多数选择取决于您希望数据最终采用什么格式以及您希望 DBI 为您完成多少工作。
我的记忆是,使用 fetchrow_arrayref 进行迭代并使用 bind_columns 是读取返回数据的最快(DBI 开销最小)的方法。
Most of the choices between fetch methods depend on what format you want the data to end up in and how much of the work for that you want DBI to do for you.
My recollection is that iterating with fetchrow_arrayref and using bind_columns is the fastest (least DBI overhead) way to read through returned data.
第一个问题是您是否真的首先需要使用
fetchall
。如果您不需要一次将所有 4200 万行存储在内存中,那么就不要一次将它们全部读入!正如 ysth 已经指出的那样,bind_columns 和 fetchrow_arrayref 通常是尽可能可行的方法。假设确实需要 fetchall,我的直觉是 fetchall_arrayref 会稍微快一些,因为数组是一种更简单的数据结构,不需要计算数组的哈希值。插入的键,但节省的时间与数据库读取时间相比相形见绌,因此它不太可能显着。
不过,内存要求完全是另一回事。
fetchall_hashref
返回的结构是id => 的哈希值。 row
,每行都表示为field name =>; 的哈希值字段值
。如果你得到 4200 万行,这意味着你的字段名称列表在 4200 万组散列键中重复...这将需要比fetchall_arrayref
。 (我想,除非 DBI 使用tie
施展一些魔法来优化fetchall_hashref
结构。)First question is whether you really need to use a
fetchall
in the first place. If you don't need all 42 million rows in memory at once, then don't read them all in at once!bind_columns
andfetchrow_arrayref
are generally the way to go whenever possible, as ysth already pointed out.Assuming that
fetchall
really is needed, my gut intuition is thatfetchall_arrayref
will be marginally faster, since an array is a simpler data structure and doesn't need to compute hashes of the inserted keys, but the savings in time would be dwarfed by database read times, so it's unlikely to be significant.Memory requirements are another matter entirely, though. The structure returned by
fetchall_hashref
is a hash ofid => row
, with each row being represented as a hash offield name => field value
. If you get 42 million rows, that means your list of field names is repeated in each of 42 million sets of hash keys... That's going to require a good deal more memory to store than the array of arrays of arrays returned byfetchall_arrayref
. (Unless DBI is doing some magic withtie
to optimize thefetchall_hashref
structure, I suppose.)