最快的子集方法 - data.table 与 MySQL
我是 R 用户,经常发现我需要编写需要对大型数据集(数十百万行)进行子集化的函数。当我将此类函数应用于大量观察时,如果我不小心实现它,可能会非常耗时。
为此,我有时使用 data.table 包,这比使用数据帧进行子集化提供了更快的速度。最近,我开始尝试像 RMySQL 这样的包,将一些表推送到 mysql,并使用该包运行 sql 查询并返回结果。
我发现性能改进参差不齐。对于较小的数据集(数百万),将数据加载到 data.table 并设置正确的键似乎可以加快子集化速度。对于较大的数据集(数十到数百百万),向 mysql 发送查询的速度似乎更快。
想知道是否有人知道哪种技术应该更快地返回简单的子集或聚合查询,以及这是否应该取决于数据的大小?我知道在 data.table 中设置键有点类似于创建索引,但除此之外我没有更多的直觉。
I'm an R user, and I frequently find that I need to write functions that require subsetting large datasets (10s of millions of rows). When I apply such functions over a large number of observations, it can get very time consuming if I'm not careful about how I implement it.
To do this, I have sometimes used the data.table package, and this provides much faster speeds than subsetting using data frames. Recently, I've started experimenting with packages like RMySQL, pushing some tables to mysql, and using the package to run sql queries and return results.
I have found mixed performance improvements. For smaller datasets (millions), it seems that loading up the data into a data.table and setting the right keys makes for faster subsetting. For larger datasets (10s to 100s of millions), it appears the sending out a query to mysql moves faster.
Was wondering if anyone has any insight into which technique should return simple subsetting or aggregation queries faster, and whether or not this should depend on the size of the data? I understand that setting keys in data.table is somewhat analogous to creating an index, but I don't have much more intuition beyond that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果数据适合 RAM,则 data.table 速度更快。如果您提供一个示例,您可能很快就会发现您使用 data.table 的方式很糟糕。您是否阅读过 data.table wiki 上的“注意事项”?
SQL 有一个下限,因为它是行存储。如果数据适合 RAM(64 位就相当多了),那么 data.table 会更快,不仅因为它在 RAM 中,而且因为列在内存中是连续的(最大限度地减少从 RAM 到 L2 的页获取以进行列操作)。正确使用data.table,它应该比SQL的下限更快。常见问题解答 3.1 对此进行了解释。如果您发现 data.table 速度变慢,则很有可能您错误地使用了 data.table(或者存在我们需要修复的性能错误)。因此,请在阅读 data.table wiki 后发布一些测试。
If the data fits in RAM, data.table is faster. If you provide an example it will probably become evident, quickly, that you're using data.table badly. Have you read the "do's and don'ts" on the data.table wiki?
SQL has a lower bound because it is a row store. If the data fits in RAM (and 64bit is quite a bit) then data.table is faster not just because it is in RAM but because columns are contiguous in memory (minimising page fetches from RAM to L2 for column operations). Use data.table correctly and it should be faster than SQL's lower bound. This is explained in FAQ 3.1. If you're seeing slower with data.table, then chances are very high that you're using data.table incorrectly (or there's a performance bug that we need to fix). So, please post some tests, after reading the data.table wiki.
我不是 R 用户,但我对数据库有所了解。我相信MySQL(或任何其他信誉良好的RDBMS)实际上会更快地执行子集操作(通常是一个数量级),除非子集过程中涉及任何额外的计算。
我怀疑您在小数据集上的性能滞后与连接和将数据初始推送到 MySQL 的费用有关。在某个时刻,连接开销和数据传输时间增加的操作成本可能会超过 MySQL 为您节省的成本。
然而,对于大于某个最小值的数据集,这种成本似乎可以通过数据库的绝对速度来补偿。
我的理解是,SQL 可以比代码中的迭代操作快得多地完成大多数获取和排序操作。但必须考虑连接成本以及(在本例中)通过网络线路进行的初始数据传输。
我有兴趣听听其他人的看法。 。 。
I am not an R user, but I know a little about Databases. I believe that MySQL (or any other reputatble RDBMS) will actually perform your subsetting operations faster (by, like, an order of magnitude, usually) barring any additional computation involved in the subsetting process.
I suspect your performance lag on small data sets is related to the expense of the connection and initial push of the data to MySQL. There is likely a point at which the connection overhead and data transfer time adds more to the cost of your operation than MySQL is saving you.
However, for datasets larger than a certain minimum, it seem likley that this cost is compensated for by the sheer speed of the database.
My understanding is that SQL can acheive most fetching and sorting operations much, much more quickly than iterative operations in code. But one must factor in the cost of the connection and (in this case) the initial transfer of data over the network wire.
I will be interested to hear what others have to say . . .