我可以使用什么语言来快速执行此数据库汇总任务?
所以我写了一个Python程序来处理一些数据处理 任务。
这是我想要的计算的虚构语言的非常简短的规范:
parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
flatten | format "%s %lf %s" aa bb cc
也就是说,对于每一行,解析出一个单词、一个浮点数和另一个单词。将它们视为玩家 ID、分数和日期。我想要每个球员的前五名得分和日期。数据量虽小,但也不算巨大;约630兆字节。
我想知道我应该用什么真正的可执行语言来编写它 让它变得同样短(如下面的Python)但速度更快。
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
# We want the top 5 for each distinct value of aa. There are
# hundreds of thousands of values of aa.
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
current.append((bb, cc))
# Every once in a while, we drop the values that are not in
# the top 5, to keep our memory footprint down, because some
# values of aa have thousands of (bb, cc) pairs.
if len(current) > 10:
current.sort()
current[:-5] = []
for aa in top_5:
current = top_5[aa]
current.sort()
for bb, cc in current[-5:]:
print aa, bb, cc
这是一些示例输入数据:
3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g
这是我从中得到的输出:
3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q
3
有七个值,因此我们删除 c
和 d
价值观 因为它们的 bb
值将它们排除在前 5 名之外。因为 4
有 只有一个值,其“前 5 个”仅由该值组成。
这比在 MySQL 中执行相同的查询运行得更快(至少, 我们发现进行查询的方式)但我很确定它正在花费 大部分时间都在Python字节码解释器中。我认为在 另一种语言,我可能可以让它处理数百个 每秒数千行而不是每分钟。所以我想 用一种实现速度更快的语言编写它。
但我不确定该选择什么语言。
我一直无法弄清楚如何将其表达为 SQL 中的单个查询,并且 事实上,我对 MySQL 的能力真的不以为然,甚至只是 select * from foo into outfile 'bar';
输入数据。
C 是一个明显的选择,但是诸如 line.split()
之类的东西,对列表进行排序 2 元组,并且制作哈希表需要编写一些代码 不在标准库中,所以我最终会得到 100 行代码 或更多而不是 14。C
++ 似乎可能是一个更好的选择(它有字符串、映射、 标准库中的对和向量)但它看起来像代码 如果用STL的话会比较混乱。
OCaml 没问题,但是它有等价的 line.split() 吗? 我会对其地图的表现感到难过吗?
Common Lisp 可能有用吗?
是否有类似于 Matlab 的数据库计算工具 这让我可以将循环推入快速代码中?有没有人尝试过
(编辑:通过提供一些示例输入和输出数据来回应 davethegr8 的评论,并修复了 Python 程序中的错误!)
(附加编辑:哇,这个评论线程到目前为止真的非常好。谢谢大家!)
编辑:
有2007 年在 sbcl-devel 上提出了类似的问题(谢谢,Rainer!),这是 Will 的一个 awk
脚本Hartung 用于生成一些测试数据(尽管它没有真实数据的 Zipfian 分布):
BEGIN {
for (i = 0; i < 27000000; i++) {
v = rand();
k = int(rand() * 100);
print k " " v " " i;
}
exit;
}
So I wrote a Python program to handle a little data processing
task.
Here's a very brief specification in a made-up language of the computation I want:
parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
flatten | format "%s %lf %s" aa bb cc
That is, for each line, parse out a word, a floating-point number, and another word. Think of them as a player ID, a score, and a date. I want the top five scores and dates for each player. The data size is not trivial, but not huge; about 630 megabytes.
I want to know what real, executable language I should have written it in to
get it to be similarly short (as the Python below) but much faster.
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
# We want the top 5 for each distinct value of aa. There are
# hundreds of thousands of values of aa.
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
current.append((bb, cc))
# Every once in a while, we drop the values that are not in
# the top 5, to keep our memory footprint down, because some
# values of aa have thousands of (bb, cc) pairs.
if len(current) > 10:
current.sort()
current[:-5] = []
for aa in top_5:
current = top_5[aa]
current.sort()
for bb, cc in current[-5:]:
print aa, bb, cc
Here’s some sample input data:
3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g
Here’s the output I get from it:
3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q
There are seven values for 3
, and so we drop the c
and d
values
because their bb
value puts them out of the top 5. Because 4
has
only one value, its “top 5” consists of just that one value.
This runs faster than doing the same queries in MySQL (at least, the
way we’ve found to do the queries) but I’m pretty sure it's spending
most of its time in the Python bytecode interpreter. I think that in
another language, I could probably get it to process hundreds of
thousands of rows per second instead of per minute. So I’d like to
write it in a language that has a faster implementation.
But I’m not sure what language to choose.
I haven’t been able to figure out how to express this as a single query in SQL, and
actually I’m really unimpressed with MySQL’s ability even to merelyselect * from foo into outfile 'bar';
the input data.
C is an obvious choice, but things like line.split()
, sorting a list
of 2-tuples, and making a hash table require writing some code that’s
not in the standard library, so I would end up with 100 lines of code
or more instead of 14.
C++ seems like it might be a better choice (it has strings, maps,
pairs, and vectors in the standard library) but it seems like the code
would be a lot messier with STL.
OCaml would be fine, but does it have an equivalent of line.split()
,
and will I be sad about the performance of its map?
Common Lisp might work?
Is there some equivalent of Matlab for database computation like this
that lets me push the loops down into fast code? Has anybody tried Pig?
(Edit: responded to davethegr8's comment by providing some sample input and output data, and fixed a bug in the Python program!)
(Additional edit: Wow, this comment thread is really excellent so far. Thanks, everybody!)
Edit:
There was an eerily similar question asked on sbcl-devel in 2007 (thanks, Rainer!), and here's an awk
script from Will Hartung for producing some test data (although it doesn't have the Zipfian distribution of the real data):
BEGIN {
for (i = 0; i < 27000000; i++) {
v = rand();
k = int(rand() * 100);
print k " " v " " i;
}
exit;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(18)
我很难相信任何事先不了解数据的脚本(与预先加载此类信息的 MySql 不同)会比 SQL 方法更快。
除了花在解析输入上的时间之外,脚本还需要“保持”按数组等对顺序进行排序...
以下是对 SQL 中应该快速运行的第一个猜测,假设表的 aa 上有一个索引 (*) 、bb、cc 列,按此顺序。 (一个可能的替代方案是“aa, bb DESC, cc”索引
(*) 该索引可以是聚簇的,也可以不是聚簇的,不会影响后面的查询。选择是否聚簇,以及是否需要“aa,bb,cc”单独的索引取决于用例、表中行的大小等。
这个想法是计算给定 aa 值内有多少记录小于 self。但是,有一个小技巧:我们需要。使用 LEFT OUTER 连接,以免我们丢弃 bb 值最大的记录或最后一条记录(可能恰好是前 5 个记录之一) 由于左连接,COUNT(*) 值计为 1, 1、2、3、4 等,因此 HAVING 测试为“<5”以有效选取前 5 个。
为了模拟 OP 的样本输出,ORDER BY 在 COUNT() 上使用 DESC,这可以是 可以删除选择列表中的 COUNT(),这不会影响查询的逻辑和正确排序的能力。
删除以获得更传统的前 5 种列表。此外,如果需要, 就处理联系而言,查询是确定性的,即,当给定的一组记录具有相同的 bb 值(在 aa 组内);我认为当输入数据的顺序改变时,Python程序可能会提供略有不同的输出,这是因为它偶尔会截断排序字典。
真正的解决方案:基于 SQL 的过程方法
上述自连接方法演示了如何使用声明性语句来表达 OP 的需求。然而,这种方法在某种意义上是幼稚的,因为它的性能大致与每个“类别”内的记录计数的平方和有关。 (不是 O(n^2),而是大致 O((n/a)^2),其中 a 是 aa 列的不同值的数量)换句话说,它对数据表现良好,使得平均关联的记录数给定的 aa 值不超过几十个。如果数据使得 aa 列没有选择性,则以下方法更适合。它利用 SQL 的高效排序框架,同时实现难以以声明方式表达的简单算法。对于每个/大多数 aa“类别”具有特别大量记录的数据集,可以通过在光标中向前(有时向后...)引入下一个 aa 值的简单二分搜索来进一步改进此方法。对于 tblAbc 中 aa“类别”数量相对于总行数较低的情况,请参阅下一个方法之后的另一种方法。
对于 aa 非常不具有选择性的情况,替代上述方法。换句话说,当我们的“类别”相对较少时。这个想法是遍历不同类别的列表,并对每个值运行“LIMIT”(MySql)“TOP”(MSSQL)查询。
出于参考目的,在相对较旧/较弱的主机上,在 MSSQL 8.0 上,对 6100 万条记录(分为 45 个 aa 值)的 tblAbc 运行了 63 秒。
关于是否需要索引的问题。 (参见OP的评论)
当仅运行“SELECT * FROM myTable”时,表扫描实际上是最快的方法,无需担心索引。然而,SQL 通常更适合此类事情的主要原因(除了作为最初积累数据的存储库,而任何外部解决方案都需要考虑导出相关数据的时间),就是它可以依靠索引来避免扫描。许多通用语言更适合处理原始处理,但它们与 SQL 进行了一场不公平的战斗,因为它们需要重建 SQL 在数据收集/导入阶段收集的数据的任何先验知识。由于排序通常是一项耗时、有时甚至占用空间的任务,因此 SQL 及其相对较慢的处理能力通常会领先于其他解决方案。
此外,即使没有预先构建的索引,现代查询优化器也可能会决定涉及创建临时索引的计划。而且,由于排序是 DDMS 的固有部分,因此 SQL 服务器在该领域通常非常高效。
那么... SQL 更好吗?
这就是说,如果我们尝试比较 SQL 和其他语言的纯 ETL 作业,即处理堆(未索引的表),它的输入来执行各种转换和过滤,用 C 语言编写的多线程实用程序并利用高效的排序库可能会更快。选择 SQL 与非 SQL 方法的决定性问题是数据位于何处以及最终应驻留在何处。如果我们只是转换要向下提供的文件,则外部程序更适合。如果我们拥有或需要 SQL 服务器中的数据,那么只有极少数情况值得在外部导出和处理。
I have a hard time believing that any script without any prior knowledge of the data (unlike MySql which has such info pre-loaded), would be faster than a SQL approach.
Aside from the time spent parsing the input, the script needs to "keep" sorting the order by array etc...
The following is a first guess at what should work decently fast in SQL, assuming a index (*) on the table's aa, bb, cc columns, in that order. (A possible alternative would be an "aa, bb DESC, cc" index
(*) This index could be clustered or not, not affecting the following query. Choice of clustering or not, and of needing an "aa,bb,cc" separate index depends on use case, on the size of the rows in table etc. etc.
The idea is to get a count of how many records, within a given aa value are smaller than self. There is a small trick however: we need to use LEFT OUTER join, lest we discard the record with the biggest bb value or the last one (which may happen to be one of the top 5). As a result of left joining it, the COUNT(*) value counts 1, 1, 2, 3, 4 etc. and the HAVING test therefore is "<5" to effectively pick the top 5.
To emulate the sample output of the OP, the ORDER BY uses DESC on the COUNT(), which could be removed to get a more traditional top 5 type of listing. Also, the COUNT() in the select list can be removed if so desired, this doesn't impact the logic of the query and the ability to properly sort.
Also note that this query is deterministic in terms of the dealing with ties, i,e, when a given set of records have a same value for bb (within an aa group); I think the Python program may provide slightly different outputs when the order of the input data is changed, that is because of its occasional truncating of the sorting dictionary.
Real solution: A SQL-based procedural approach
The self-join approach described above demonstrates how declarative statements can be used to express the OP's requirement. However this approach is naive in a sense that its performance is roughly bound to the sum of the squares of record counts within each aa 'category'. (not O(n^2) but roughly O((n/a)^2) where a is the number of different values for the aa column) In other words it performs well with data such that on average the number of records associated with a given aa value doesn't exceed a few dozens. If the data is such that the aa column is not selective, the following approach is much -much!- better suited. It leverages SQL's efficient sorting framework, while implementing a simple algorithm that would be hard to express in declarative fashion. This approach could further be improved for datasets with particularly huge number of records each/most aa 'categories' by introducing a simple binary search of the next aa value, by looking ahead (and sometimes back...) in the cursor. For cases where the number of aa 'categories' relative to the overall row count in tblAbc is low, see yet another approach, after this next one.
Alternative to the above for cases when aa is very unselective. In other words, when we have relatively few aa 'categories'. The idea is to go through the list of distinct categories and to run a "LIMIT" (MySql) "TOP" (MSSQL) query for each of these values.
For reference purposes, the following ran in 63 seconds for tblAbc of 61 Million records divided in 45 aa values, on MSSQL 8.0, on a relatively old/weak host.
On the question of needing an index or not. (cf OP's remark)
When merely running a "SELECT * FROM myTable", a table scan is effectively the fastest appraoch, no need to bother with indexes. However, the main reason why SQL is typically better suited for this kind of things (aside from being the repository where the data has been accumulating in the first place, whereas any external solution needs to account for the time to export the relevant data), is that it can rely on indexes to avoid scanning. Many general purpose languages are far better suited to handle raw processing, but they are fighting an unfair battle with SQL because they need to rebuilt any prior knowledge of the data which SQL has gathered in the course of its data collection / import phase. Since sorting is a typically a time and sometimes space consuming task, SQL and its relatively slower processing power often ends up ahead of alternative solutions.
Also, even without pre-built indexes, modern query optimizers may decide on a plan that involves the creation of a temporary index. And, because sorting is an intrinsic part of DDMS, the SQL servers are generally efficient in that area.
So... Is SQL better?
This said, if we are trying to compare SQL and other languages for pure ETL jobs, i.e. for dealing with heaps (unindexed tables) as its input to perform various transformations and filtering, it is likely that multi-thread-able utilities written in say C, and leveraging efficient sorting libaries, would likely be faster. The determining question to decide on a SQL vs. Non-SQL approach is where the data is located and where should it eventually reside. If we merely to convert a file to be supplied down "the chain" external programs are better suited. If we have or need the data in a SQL server, there are only rare cases that make it worthwhile exporting and processing externally.
您可以使用更智能的数据结构并仍然使用 python。
我已经在我的机器上运行了您的参考实现和我的 python 实现,甚至还比较了输出以确保结果。
这是你的:
这是我的:
这是来源:
更新:了解你的极限。
我还对 noop 代码进行了计时,以了解最快的 python 解决方案,其代码类似于原始代码:
以及 noop.py 本身:
You could use smarter data structures and still use python.
I've ran your reference implementation and my python implementation on my machine and even compared the output to be sure in results.
This is yours:
This is mine:
And this is the source:
Update: Know your limits.
I've also timed a noop code, to know the fastest possible python solution with code similar to the original:
And the noop.py itself:
在我的机器上,这花费了 45.7 秒,包含 27M 行数据,如下所示:
你的脚本在这些数据上花费了 1m42,c++ 示例也是 1m46(g++ t.cpp -ot 来编译它,我对 c++ 一无所知) 。
Java 6,其实并不重要。输出并不完美,但很容易修复。
用于伪造数据的 AWK 脚本:
This took 45.7s on my machine with 27M rows of data that looked like this:
Your script took 1m42 on this data, the c++ example too 1m46 (g++ t.cpp -o t to compile it, I don't know anything about c++).
Java 6, not that it matters really. Output isn't perfect, but it's easy to fix.
AWK script to fake the data:
这是 Common Lisp 中的一个草图。
请注意,对于长文件,使用 READ-LINE 会受到惩罚,因为它为每一行保留一个新字符串。然后使用 READ-LINE 的衍生产品之一,这些衍生产品使用行缓冲区浮动。您还可以检查是否希望哈希表区分大小写。
第二个版本
不再需要拆分字符串,因为我们在这里这样做。这是低级代码,希望能够提高一些速度。它检查一个或多个空格作为字段分隔符以及制表符。
上面的函数返回两个值:键和其余值的列表。
一些测试数据
; (create-test-data)
第三版,LispWorks
使用一些 SPLIT-STRING 和 PARSE-FLOAT 函数,否则使用通用 CL。
This is a sketch in Common Lisp
Note that for long files there is a penalty for using READ-LINE, because it conses a fresh string for each line. Then use one of the derivatives of READ-LINE that are floating around that are using a line buffer. Also you might check if you want the hash table be case sensitive or not.
second version
Splitting the string is no longer needed, because we do it here. It is low level code, in the hope that some speed gains will be possible. It checks for one or more spaces as field delimiter and also tabs.
Above function returns two values: the key and a list of the rest.
some test data
; (create-test-data)
third version, LispWorks
Uses some SPLIT-STRING and PARSE-FLOAT functions, otherwise generic CL.
这是另一个 OCaml 版本 - 以速度为目标 - 在 Streams 上带有自定义解析器。太长了,但是解析器的某些部分是可以重用的。感谢 peufeu 引发竞争:)
速度:
编译:
代码:
Here is one more OCaml version - targeted for speed - with custom parser on Streams. Too long, but parts of the parser are reusable. Thanks peufeu for triggering competition :)
Speed :
Compile with :
Code :
在我迄今为止测试过的所有线程中的程序中,OCaml 版本是最快的,也是最短的之一。 (基于代码行的测量有点模糊,但它并不比 Python 版本或 C 或 C++ 版本明显长,而且它明显更快。 )
以下是迄今为止不同版本在 2700 万行 630 MB 输入数据文件上运行的时序。我使用双核 1.6GHz Celeron 上的 Ubuntu Intrepid Ibex,运行 32 位版本的操作系统(以太网驱动程序在 64 位版本中损坏)。我运行每个程序五次,并报告这五次尝试所花费的时间范围。我使用的是 Python 2.5.2、OpenJDK 1.6.0.0、OCaml 3.10.2、GCC 4.3.2、SBCL 1.0.8.debian 和 Octave 3.0.1。
apt-get install pig
),7 行代码。sort -n
组成:90 + 5.5 秒 (±3, ±0.1),但由于 GNU 的缺陷而给出了错误的答案 < code>sort,22 行代码(包括一行 shell)。hrntnoop.py
,故意给出错误的结果来建立下限:尚未测试,15 行代码。我认为 Abbot 的版本对我来说比对他们来说相对更糟糕,因为真实的数据集具有高度不均匀的分布:正如我所说,一些
aa
值(“玩家”)有数千行,而其他值只有数千行有一个。关于 Psyco:我将 Psyco 应用于我的原始代码(以及 abbot 的版本),将其放入
main
函数中,该函数本身将时间减少到大约 140 秒,并调用psyco.full ()
在调用main()
之前。这增加了大约四行代码。我可以几乎使用 GNU
sort
解决问题,如下所示:这里
top5_sorted_c
是这个简短的 C 程序:我首先尝试用 C++ 编写该程序如下所示,我得到的运行时间要慢得多,为 33.6±2.3 秒,而不是 5.5±0.1 秒:
我确实说几乎。问题是
sort -n
对于大多数数据都可以,但当它尝试将0.33
与3.78168e-05
进行比较时会失败。因此,为了获得这种性能并真正解决问题,我需要更好的排序。不管怎样,我有点像在抱怨,但是排序和过滤方法比 Python 程序大约快 5 倍,而来自 hrnt 的优雅的 STL 程序实际上要慢一些——似乎有某种
效率严重低下。我不知道另外 83% 的运行时间在那个小的 C++ 版本的过滤器中去了哪里,但它没有去任何有用的地方,这让我怀疑我不知道它在 hrnt 的中去了哪里std::map
版本之一。该版本也可以加速 5 倍吗?因为那会很酷。它的工作集可能比我的二级缓存大,但实际上它可能不是。对 callgrind 的一些调查表明,我的 C++ 过滤程序正在
operator >>
内部执行 97% 的指令。我可以识别每个输入字节至少 10 个函数调用,而cin.sync_with_stdio(false);
没有帮助。这可能意味着我可以通过更有效地解析输入行来使 hrnt 的 C 程序运行得更快。编辑:kcachegrind 声称 hrnt 的程序执行了 62% 的指令(在一个 157000 行的小型输入文件上),从
istream
中提取double
。其中很大一部分是因为 istreams 库在尝试解析 double 时显然为每个输入字节执行了大约 13 个函数调用。疯狂的。我可能会误解 kcachegrind 的输出吗?无论如何,还有其他建议吗?
Of all the programs in this thread that I've tested so far, the OCaml version is the fastest and also among the shortest. (Line-of-code-based measurements are a little fuzzy, but it's not clearly longer than the Python version or the C or C++ versions, and it is clearly faster.)
Here are the timings for the different versions so far, running on a 27-million-row 630-megabyte input data file. I'm on Ubuntu Intrepid Ibex on a dual-core 1.6GHz Celeron, running a 32-bit version of the OS (the Ethernet driver was broken in the 64-bit version). I ran each program five times and report the range of times those five tries took. I'm using Python 2.5.2, OpenJDK 1.6.0.0, OCaml 3.10.2, GCC 4.3.2, SBCL 1.0.8.debian, and Octave 3.0.1.
apt-get install pig
), 7 lines of code.sort -n
: 90 + 5.5 seconds (±3, ±0.1), but gives the wrong answer because of a deficiency in GNUsort
, in 22 lines of code (including one line of shell.)noop.py
, which intentionally gives incorrect results to establish a lower bound: not yet tested, 15 lines of code.I supect abbot's version comes out relatively worse for me than for them because the real dataset has a highly nonuniform distribution: as I said, some
aa
values (“players”) have thousands of lines, while others only have one.About Psyco: I applied Psyco to my original code (and abbot's version) by putting it in a
main
function, which by itself cut the time down to about 140 seconds, and callingpsyco.full()
before callingmain()
. This added about four lines of code.I can almost solve the problem using GNU
sort
, as follows:Here
top5_sorted_c
is this short C program:I first tried writing that program in C++ as follows, and I got runtimes which were substantially slower, at 33.6±2.3 seconds instead of 5.5±0.1 seconds:
I did say almost. The problem is that
sort -n
does okay for most of the data, but it fails when it's trying to compare0.33
with3.78168e-05
. So to get this kind of performance and actually solve the problem, I need a better sort.Anyway, I kind of feel like I'm whining, but the sort-and-filter approach is about 5× faster than the Python program, while the elegant STL program from hrnt is actually a little slower — there seems to be some kind of gross inefficiency in
<iostream>
. I don't know where the other 83% of the runtime is going in that little C++ version of the filter, but it isn't going anywhere useful, which makes me suspect I don't know where it's going in hrnt'sstd::map
version either. Could that version be sped up 5× too? Because that would be pretty cool. Its working set might be bigger than my L2 cache, but as it happens it probably isn't.Some investigation with callgrind says my filter program in C++ is executing 97% of its instructions inside of
operator >>
. I can identify at least 10 function calls per input byte, andcin.sync_with_stdio(false);
doesn’t help. This probably means I could get hrnt’s C program to run substantially faster by parsing input lines more efficiently.Edit: kcachegrind claims that hrnt’s program executes 62% of its instructions (on a small 157000 line input file) extracting
double
s from anistream
. A substantial part of this is because the istreams library apparently executes about 13 function calls per input byte when trying to parse adouble
. Insane. Could I be misunderstanding kcachegrind's output?Anyway, any other suggestions?
非常简单的 Caml(27 * 10^6 行 -- 27 秒,C++ by hrnt -- 29 秒)
Pretty straightforward Caml (27 * 10^6 rows -- 27 sec, C++ by hrnt -- 29 sec)
这是一个 C++ 解决方案。然而,我没有太多数据来测试它,所以我不知道它实际上有多快。
[编辑] 感谢本线程中 awk 脚本提供的测试数据,我
设法清理并加快了代码的速度。我并不是想找出最快的版本 - 目的是提供一个相当快的版本,它不像人们认为的 STL 解决方案那么难看。
这个版本的速度大约是第一个版本的两倍(大约 35 秒内完成 2700 万行)。 Gcc 用户请记住
使用 -O2 编译它。
Here is a C++ solution. I didn't have a lot of data to test it with, however, so I don't know how fast it actually is.
[edit] Thanks to the test data provided by the awk script in this thread, I
managed to clean up and speed up the code a bit. I am not trying to find out the fastest possible version - the intent is to provide a reasonably fast version that isn't as ugly as people seem to think STL solutions can be.
This version should be about twice as fast as the first version (goes through 27 million lines in about 35 seconds). Gcc users, remember to
compile this with -O2.
有趣的是,原始的 Python 解决方案是迄今为止最干净的(尽管 C++ 示例很接近)。
在您的原始代码上使用 Pyrex 或 Psyco 怎么样?
Interestingly, the original Python solution is by far the cleanest looking (although the C++ example comes close).
How about using Pyrex or Psyco on your original code?
有没有人尝试过只用 awk 来解决这个问题。特别是“mawk”?根据这篇博客文章,它应该比 Java 和 C++ 更快:http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data -munging-language/
编辑:只是想澄清一下,该博客文章中提出的唯一主张是,对于特别适合 awk 式处理的某一类问题,mawk 虚拟机可以击败 ' Java 和 C++ 中的 vanilla 实现。
Has anybody tried doing this problem with just awk. Specifically 'mawk'? It should be faster than even Java and C++, according to this blog post: http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
EDIT: Just wanted to clarify that the only claim being made in that blog post is that for a certain class of problems that are specifically suited to awk-style processing, the mawk virtual machine can beat 'vanilla' implementations in Java and C++.
既然您询问了 Matlab,这就是我如何完成您所要求的事情。我尝试在不使用任何 for 循环的情况下完成此操作,但我确实有一个,因为我不想花很长时间。如果您担心内存问题,那么您可以使用 fscanf 从流中分块提取数据,而不是读取整个缓冲区。
Since you asked about Matlab, here's how I did something like what you're asking for. I tried to do it without any for loops, but I do have one because I didn't care to take a long time with it. If you were worried about memory then you could pull data from the stream in chunks with fscanf rather than reading the entire buffer.
说到计算时间的下限:
让我们分析上面的算法:
设 N 为 top-N 中的 N
令 R 为数据集中的行数
令 K 为不同键的数量
我们可以做出什么假设?
R * sizeof( 行 ) > RAM 或至少足够大,我们不想加载全部内容,使用散列按键分组,并对每个容器进行排序。出于同样的原因,我们不会对所有内容进行排序。
Kragen 喜欢哈希表,因此 K * sizeof(per-key state) << RAM,很可能它适合 L2/3 缓存
Kragen 没有排序,所以 K*N << R 即每个 key 的条目数远多于 N 个
(注:A << B 表示 A 相对于 B 较小)
如果数据具有随机分布,那么
在少量行之后,大多数行将被拒绝根据每个键的最小条件,成本是每行 1 次比较。
因此,每行的成本是 1 次哈希查找 + 1 次比较 + epsilon *(列表插入 + 最小值的 (N+1) 次比较)
如果分数具有随机分布(例如在 0 和 1 之间)并且上述条件成立,则两者都成立epsilon 将非常小。
实验证明:
上面的 2700 万行数据集在前 N 个列表中产生了 5933 个插入。所有其他行都会被简单的键查找和比较拒绝。 epsilon = 0.0001
因此,粗略地说,成本是每行 1 次查找 + 比较,这需要几纳秒。
在当前的硬件上,与 IO 成本(尤其是解析成本)相比,这绝对是可以忽略不计的。
Speaking of lower bounds on compute time :
Let's analyze my algo above :
Let N be the N in top-N
Let R be the number of rows in your data set
Let K be the number of distinct keys
What assumptions can we make ?
R * sizeof( row ) > RAM or at least it's big enough that we don't want to load it all, use a hash to group by key, and sort each bin. For the same reason we don't sort the whole stuff.
Kragen likes hashtables, so K * sizeof(per-key state) << RAM, most probably it fits in L2/3 cache
Kragen is not sorting, so K*N << R ie each key has much more than N entries
(note : A << B means A is small relative to B)
If the data has a random distribution, then
after a small number of rows, the majority of rows will be rejected by the per-key minimum condition, the cost is 1 comparison per row.
So the cost per row is 1 hash lookup + 1 comparison + epsilon * (list insertion + (N+1) comparisons for the minimum)
If the scores have a random distribution (say between 0 and 1) and the conditions above hold, both epsilons will be very small.
Experimental proof :
The 27 million rows dataset above produces 5933 insertions into the top-N lists. All other rows are rejected by a simple key lookup and comparison. epsilon = 0.0001
So roughly, the cost is 1 lookup + coparison per row, which takes a few nanoseconds.
On current hardware, there is no way this is not going to be negligible versus IO cost and especially parsing costs.
这不就是这么简单吗
?
当然,如果不根据数据进行测试,很难判断什么是最快的。如果您需要运行得非常快,那么优化数据库以使查询更快可能比优化查询更有意义。
当然,如果您无论如何都需要平面文件,您也可以使用它。
Isn't this just as simple as
?
Of course, it's hard to tell what would be fastest without testing it against the data. And if this is something you need to run very fast, it might make sense to optimize your database to make the query faster, rather than optimizing the query.
And, of course, if you need the flat file anyway, you might as well use that.
选择“前 5 个”看起来像这样。请注意,没有排序。 top_5 字典中的任何列表也不会超过 5 个元素。
Pick "top 5" would look something like this. Note that there's no sorting. Nor does any list in the top_5 dictionary ever grow beyond 5 elements.
Pig 版本会是这样的(未经测试):
Pig 没有针对性能进行优化;它的目标是使用并行执行框架处理多 TB 数据集。它确实有本地模式,所以你可以尝试一下,但我怀疑它会打败你的脚本。
The Pig version would go something like this (untested):
Pig isn't optimized for performance; it's goal is to enable processing of multi-terabyte datasets using parallel execution frameworks. It does have a local mode, so you can try it, but I doubt it will beat your script.
这是一个很好的午休挑战,他,他。
Top-N是众所周知的数据库杀手。正如上面的帖子所示,没有办法用普通的 SQL 来有效地表达它。
至于各种实现,您必须记住,其中缓慢的部分不是排序或前 N 项,而是文本的解析。你最近看过 glibc 的 strtod() 的源代码吗?
例如,我得到,使用Python:
无论您使用什么语言,您很可能永远不会获得非常快的计时,除非您的数据采用某种比文本解析速度快得多的格式。例如,将测试数据加载到 postgres 需要 70 秒,其中大部分也是文本解析。
如果 topN 中的 N 很小,比如 5,那么下面我的算法的 C 实现可能是最快的。如果 N 可以更大,堆是一个更好的选择。
因此,由于您的数据可能在数据库中,并且您的问题是获取数据,而不是实际处理,如果您确实需要一个超快的 TopN 引擎,那么您应该做的是为您的数据编写一个 C 模块选择的数据库。由于 postgres 在任何方面都更快,我建议使用 postgres,而且为其编写 C 模块并不困难。
这是我的 Python 代码:
That was a nice lunch break challenge, he, he.
Top-N is a well-known database killer. As shown by the post above, there is no way to efficiently express it in common SQL.
As for the various implementations, you got to keep in mind that the slow part in this is not the sorting or the top-N, it's the parsing of text. Have you looked at the source code for glibc's strtod() lately ?
For instance, I get, using Python :
It is quite likely that you'll never get very fast timings, no matter what language you use, unless your data is in some format that is a lot faster to parse than text. For instance, loading the test data into postgres takes 70 s, and the majority of that is text parsing, too.
If the N in your topN is small, like 5, a C implementation of my algorithm below would probably be the fastest. If N can be larger, heaps are a much better option.
So, since your data is probably in a database, and your problem is getting at the data, not the actual processing, if you're really in need of a super fast TopN engine, what you should do is write a C module for your database of choice. Since postgres is faster for about anything, I suggest using postgres, plus it isn't difficult to write a C module for it.
Here's my Python code :
我喜欢午休挑战。这是一个 1 小时的实现。
好吧,当您不想做一些极其奇特的垃圾(例如加法)时,没有什么可以阻止您使用自定义的基数 10 浮点格式,其唯一实现的运算符是比较,对吗?哈哈。
我有一些来自之前项目的 fast-atoi 代码,所以我只是将其导入。
http://www.copypastecode.com/11541/
此 C 源代码大约需要 6.6 秒解析 580MB 的输入文本(2700 万行),一半的时间是 fgets,哈哈。然后计算 top-n 大约需要 0.05 秒,但我不确定,因为 top-n 所需的时间小于计时器噪音。
你将成为测试其正确性的人 XDDDDDDDDDDDD
有趣吧?
I love lunch break challenges. Here's a 1 hour implementation.
OK, when you don't want do some extremely exotic crap like additions, nothing stops you from using a custom base-10 floating point format whose only implemented operator is comparison, right ? lol.
I had some fast-atoi code lying around from a previous project, so I just imported that.
http://www.copypastecode.com/11541/
This C source code takes about 6.6 seconds to parse the 580MB of input text (27 million lines), half of that time is fgets, lol. Then it takes approximately 0.05 seconds to compute the top-n, but I don't know for sure, since the time it takes for the top-n is less than the timer noise.
You'll be the one to test it for correctness though XDDDDDDDDDDD
Interesting huh ?
好吧,请喝杯咖啡并阅读 strtod 的源代码——它令人难以置信,但如果你想漂浮的话,这是必需的 ->文本-> float 返回与您开始使用的相同的浮点......真的......
解析整数要快得多(虽然在 python 中不是那么快,但在 C 中,是的)。
无论如何,将数据放入 Postgres 表中:
=> 7 秒(因此读取 27M 记录需要 7 秒)
191 秒
12 秒
(您也可以使用索引更快地获取“key”的不同值,但需要一些轻微的 plpgsql 黑客攻击)
临时:15,310 毫秒
临时:17,853 毫秒
Temps : 13,983 ms
Temps : 16,860 ms
Temps : 17,651 ms
Temps : 19,216 ms
如您所见,计算 top-n 非常快(假设 n 很小),但创建(强制)索引非常慢,因为它涉及完整排序。
最好的选择是使用快速解析的格式(二进制或为数据库编写自定义 C 聚合,恕我直言,这将是最佳选择)。如果Python可以在1秒内完成,那么C程序的运行时间不应超过1秒。
Well, please grab a coffee and read the source code for strtod -- it's mindboggling, but needed, if you want to float -> text -> float to give back the same float you started with.... really...
Parsing integers is a lot faster (not so much in python, though, but in C, yes).
Anyway, putting the data in a Postgres table :
=> 7 s (so it takes 7 s to read the 27M records)
191 s
12 s
(You can use the index to get distinct values of 'key' faster too but it requires some light plpgsql hacking)
Temps : 15,310 ms
Temps : 17,853 ms
Temps : 13,983 ms
Temps : 16,860 ms
Temps : 17,651 ms
Temps : 19,216 ms
As you can see computing the top-n is extremely fast (provided n is small) but creating the (mandatory) index is extremely slow because it involves a full sort.
Your best bet is to use a format that is fast to parse (either binary, or write a custom C aggregate for your database, which would be the best choice IMHO). The runtime in the C program shouldn't be more than 1s if python can do it in 1 s.