我可以使用什么语言来快速执行此数据库汇总任务？

发布于 2024-08-05 09:16:16 字数 2663 浏览 10 评论 0原文

所以我写了一个Python程序来处理一些数据处理任务。

这是我想要的计算的虚构语言的非常简短的规范：

parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
    flatten | format "%s %lf %s" aa bb cc

也就是说，对于每一行，解析出一个单词、一个浮点数和另一个单词。将它们视为玩家 ID、分数和日期。我想要每个球员的前五名得分和日期。数据量虽小，但也不算巨大；约630兆字节。

我想知道我应该用什么真正的可执行语言来编写它让它变得同样短（如下面的Python）但速度更快。

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    # We want the top 5 for each distinct value of aa.  There are
    # hundreds of thousands of values of aa.
    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    current.append((bb, cc))

    # Every once in a while, we drop the values that are not in
    # the top 5, to keep our memory footprint down, because some
    # values of aa have thousands of (bb, cc) pairs.
    if len(current) > 10:
        current.sort()
        current[:-5] = []

for aa in top_5:
    current = top_5[aa]
    current.sort()
    for bb, cc in current[-5:]:
        print aa, bb, cc

这是一些示例输入数据：

3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g

这是我从中得到的输出：

3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q

3 有七个值，因此我们删除 c 和 d价值观因为它们的 bb 值将它们排除在前 5 名之外。因为 4 有只有一个值，其“前 5 个”仅由该值组成。

这比在 MySQL 中执行相同的查询运行得更快（至少，我们发现进行查询的方式）但我很确定它正在花费大部分时间都在Python字节码解释器中。我认为在另一种语言，我可能可以让它处理数百个每秒数千行而不是每分钟。所以我想用一种实现速度更快的语言编写它。

但我不确定该选择什么语言。

我一直无法弄清楚如何将其表达为 SQL 中的单个查询，并且事实上，我对 MySQL 的能力真的不以为然，甚至只是 select * from foo into outfile 'bar'; 输入数据。

C 是一个明显的选择，但是诸如 line.split() 之类的东西，对列表进行排序 2 元组，并且制作哈希表需要编写一些代码不在标准库中，所以我最终会得到 100 行代码或更多而不是 14。C

++ 似乎可能是一个更好的选择（它有字符串、映射、标准库中的对和向量）但它看起来像代码如果用STL的话会比较混乱。

OCaml 没问题，但是它有等价的 line.split() 吗？我会对其地图的表现感到难过吗？

Common Lisp 可能有用吗？

是否有类似于 Matlab 的数据库计算工具这让我可以将循环推入快速代码中？有没有人尝试过

（编辑：通过提供一些示例输入和输出数据来回应 davethegr8 的评论，并修复了 Python 程序中的错误！）

（附加编辑：哇，这个评论线程到目前为止真的非常好。谢谢大家！）

编辑：

有2007 年在 sbcl-devel 上提出了类似的问题（谢谢，Rainer！），这是 Will 的一个 awk 脚本Hartung 用于生成一些测试数据（尽管它没有真实数据的 Zipfian 分布）：

BEGIN {
 for (i = 0; i < 27000000; i++) {
  v = rand();
  k = int(rand() * 100);
  print k " " v " " i;
 }
 exit;
}

原文

So I wrote a Python program to handle a little data processing
task.

Here's a very brief specification in a made-up language of the computation I want:

parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
    flatten | format "%s %lf %s" aa bb cc

That is, for each line, parse out a word, a floating-point number, and another word. Think of them as a player ID, a score, and a date. I want the top five scores and dates for each player. The data size is not trivial, but not huge; about 630 megabytes.

I want to know what real, executable language I should have written it in to
get it to be similarly short (as the Python below) but much faster.

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    # We want the top 5 for each distinct value of aa.  There are
    # hundreds of thousands of values of aa.
    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    current.append((bb, cc))

    # Every once in a while, we drop the values that are not in
    # the top 5, to keep our memory footprint down, because some
    # values of aa have thousands of (bb, cc) pairs.
    if len(current) > 10:
        current.sort()
        current[:-5] = []

for aa in top_5:
    current = top_5[aa]
    current.sort()
    for bb, cc in current[-5:]:
        print aa, bb, cc

Here’s some sample input data:

3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g

Here’s the output I get from it:

3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q

There are seven values for 3, and so we drop the c and d values
because their bb value puts them out of the top 5. Because 4 has
only one value, its “top 5” consists of just that one value.

This runs faster than doing the same queries in MySQL (at least, the
way we’ve found to do the queries) but I’m pretty sure it's spending
most of its time in the Python bytecode interpreter. I think that in
another language, I could probably get it to process hundreds of
thousands of rows per second instead of per minute. So I’d like to
write it in a language that has a faster implementation.

But I’m not sure what language to choose.

I haven’t been able to figure out how to express this as a single query in SQL, and
actually I’m really unimpressed with MySQL’s ability even to merely
select * from foo into outfile 'bar'; the input data.

C is an obvious choice, but things like line.split(), sorting a list
of 2-tuples, and making a hash table require writing some code that’s
not in the standard library, so I would end up with 100 lines of code
or more instead of 14.

C++ seems like it might be a better choice (it has strings, maps,
pairs, and vectors in the standard library) but it seems like the code
would be a lot messier with STL.

OCaml would be fine, but does it have an equivalent of line.split(),
and will I be sad about the performance of its map?

Common Lisp might work?

Is there some equivalent of Matlab for database computation like this
that lets me push the loops down into fast code? Has anybody tried Pig?

(Edit: responded to davethegr8's comment by providing some sample input and output data, and fixed a bug in the Python program!)

(Additional edit: Wow, this comment thread is really excellent so far. Thanks, everybody!)

Edit:

There was an eerily similar question asked on sbcl-devel in 2007 (thanks, Rainer!), and here's an awk script from Will Hartung for producing some test data (although it doesn't have the Zipfian distribution of the real data):

BEGIN {
 for (i = 0; i < 27000000; i++) {
  v = rand();
  k = int(rand() * 100);
  print k " " v " " i;
 }
 exit;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撩起发的微风 2024-08-12 09:16:16

我很难相信任何事先不了解数据的脚本（与预先加载此类信息的 MySql 不同）会比 SQL 方法更快。

除了花在解析输入上的时间之外，脚本还需要“保持”按数组等对顺序进行排序...

以下是对 SQL 中应该快速运行的第一个猜测，假设表的 aa 上有一个索引 (*) 、bb、cc 列，按此顺序。（一个可能的替代方案是“aa, bb DESC, cc”索引

(*) 该索引可以是聚簇的，也可以不是聚簇的，不会影响后面的查询。选择是否聚簇，以及是否需要“aa,bb,cc”单独的索引取决于用例、表中行的大小等。

SELECT T1.aa, T1.bb, T1.cc , COUNT(*)
FROM tblAbc T1
LEFT OUTER JOIN tblAbc T2 ON T1.aa = T2.aa AND 
         (T1.bb < T2.bb OR(T1.bb = T2.bb AND T1.cc < T2.cc))
GROUP BY T1.aa, T1.bb, T1.cc
HAVING COUNT(*) < 5  -- trick, remember COUNT(*) goes 1,1,2,3,...
ORDER BY T1.aa, T1.bb, T1.cc, COUNT(*) DESC

这个想法是计算给定 aa 值内有多少记录小于 self。但是，有一个小技巧：我们需要。使用 LEFT OUTER 连接，以免我们丢弃 bb 值最大的记录或最后一条记录（可能恰好是前 5 个记录之一）由于左连接，COUNT(*) 值计为 1， 1、2、3、4 等，因此 HAVING 测试为“<5”以有效选取前 5 个。

为了模拟 OP 的样本输出，ORDER BY 在 COUNT() 上使用 DESC，这可以是可以删除选择列表中的 COUNT()，这不会影响查询的逻辑和正确排序的能力。

删除以获得更传统的前 5 种列表。此外，如果需要，就处理联系而言，查询是确定性的，即，当给定的一组记录具有相同的 bb 值（在 aa 组内）；我认为当输入数据的顺序改变时，Python程序可能会提供略有不同的输出，这是因为它偶尔会截断排序字典。

真正的解决方案：基于 SQL 的过程方法

上述自连接方法演示了如何使用声明性语句来表达 OP 的需求。然而，这种方法在某种意义上是幼稚的，因为它的性能大致与每个“类别”内的记录计数的平方和有关。（不是 O(n^2)，而是大致 O((n/a)^2)，其中 a 是 aa 列的不同值的数量）换句话说，它对数据表现良好，使得平均关联的记录数给定的 aa 值不超过几十个。如果数据使得 aa 列没有选择性，则以下方法更适合。它利用 SQL 的高效排序框架，同时实现难以以声明方式表达的简单算法。对于每个/大多数 aa“类别”具有特别大量记录的数据集，可以通过在光标中向前（有时向后...）引入下一个 aa 值的简单二分搜索来进一步改进此方法。对于 tblAbc 中 aa“类别”数量相对于总行数较低的情况，请参阅下一个方法之后的另一种方法。

DECLARE @aa AS VARCHAR(10), @bb AS INT, @cc AS VARCHAR(10)
DECLARE @curAa AS VARCHAR(10)
DECLARE @Ctr AS INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE abcCursor CURSOR 
  FOR SELECT aa, bb, cc
  FROM tblABC
  ORDER BY aa, bb DESC, cc
  FOR READ ONLY;

OPEN abcCursor;

SET @curAa = ''

FETCH NEXT FROM abcCursor INTO @aa, @bb, @cc;
WHILE @@FETCH_STATUS = 0
BEGIN
    IF @curAa <> @aa
    BEGIN
       SET @Ctr = 0
       SET @curAa = @aa
    END
    IF @Ctr < 5
    BEGIN
       SET @Ctr = @Ctr + 1;
       INSERT tblResults VALUES(@aa, @bb, @cc);
    END
    FETCH NEXT FROM AbcCursor INTO @aa, @bb, @cc;
END;

CLOSE abcCursor;
DEALLOCATE abcCursor;

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

对于 aa 非常不具有选择性的情况，替代上述方法。换句话说，当我们的“类别”相对较少时。这个想法是遍历不同类别的列表，并对每个值运行“LIMIT”（MySql）“TOP”（MSSQL）查询。
出于参考目的，在相对较旧/较弱的主机上，在 MSSQL 8.0 上，对 6100 万条记录（分为 45 个 aa 值）的 tblAbc 运行了 63 秒。

DECLARE @aa AS VARCHAR(10)
DECLARE @aaCount INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE aaCountCursor CURSOR 
  FOR SELECT aa, COUNT(*)
  FROM tblABC
  GROUP BY aa
  ORDER BY aa
  FOR READ ONLY;
OPEN aaCountCursor;


FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount
WHILE @@FETCH_STATUS = 0
BEGIN
    INSERT tblResults 
       SELECT TOP 5 aa, bb, cc
       FROM tblproh
       WHERE aa = @aa
       ORDER BY aa, bb DESC, cc

    FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount;
END;

CLOSE aaCountCursor
DEALLOCATE aaCountCursor

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

关于是否需要索引的问题。（参见OP的评论）
当仅运行“SELECT * FROM myTable”时，表扫描实际上是最快的方法，无需担心索引。然而，SQL 通常更适合此类事情的主要原因（除了作为最初积累数据的存储库，而任何外部解决方案都需要考虑导出相关数据的时间），就是它可以依靠索引来避免扫描。许多通用语言更适合处理原始处理，但它们与 SQL 进行了一场不公平的战斗，因为它们需要重建 SQL 在数据收集/导入阶段收集的数据的任何先验知识。由于排序通常是一项耗时、有时甚至占用空间的任务，因此 SQL 及其相对较慢的处理能力通常会领先于其他解决方案。

此外，即使没有预先构建的索引，现代查询优化器也可能会决定涉及创建临时索引的计划。而且，由于排序是 DDMS 的固有部分，因此 SQL 服务器在该领域通常非常高效。

那么... SQL 更好吗？

这就是说，如果我们尝试比较 SQL 和其他语言的纯 ETL 作业，即处理堆（未索引的表），它的输入来执行各种转换和过滤，用 C 语言编写的多线程实用程序并利用高效的排序库可能会更快。选择 SQL 与非 SQL 方法的决定性问题是数据位于何处以及最终应驻留在何处。如果我们只是转换要向下提供的文件，则外部程序更适合。如果我们拥有或需要 SQL 服务器中的数据，那么只有极少数情况值得在外部导出和处理。

I have a hard time believing that any script without any prior knowledge of the data (unlike MySql which has such info pre-loaded), would be faster than a SQL approach.

Aside from the time spent parsing the input, the script needs to "keep" sorting the order by array etc...

The following is a first guess at what should work decently fast in SQL, assuming a index (*) on the table's aa, bb, cc columns, in that order. (A possible alternative would be an "aa, bb DESC, cc" index

(*) This index could be clustered or not, not affecting the following query. Choice of clustering or not, and of needing an "aa,bb,cc" separate index depends on use case, on the size of the rows in table etc. etc.

SELECT T1.aa, T1.bb, T1.cc , COUNT(*)
FROM tblAbc T1
LEFT OUTER JOIN tblAbc T2 ON T1.aa = T2.aa AND 
         (T1.bb < T2.bb OR(T1.bb = T2.bb AND T1.cc < T2.cc))
GROUP BY T1.aa, T1.bb, T1.cc
HAVING COUNT(*) < 5  -- trick, remember COUNT(*) goes 1,1,2,3,...
ORDER BY T1.aa, T1.bb, T1.cc, COUNT(*) DESC

The idea is to get a count of how many records, within a given aa value are smaller than self. There is a small trick however: we need to use LEFT OUTER join, lest we discard the record with the biggest bb value or the last one (which may happen to be one of the top 5). As a result of left joining it, the COUNT(*) value counts 1, 1, 2, 3, 4 etc. and the HAVING test therefore is "<5" to effectively pick the top 5.

To emulate the sample output of the OP, the ORDER BY uses DESC on the COUNT(), which could be removed to get a more traditional top 5 type of listing. Also, the COUNT() in the select list can be removed if so desired, this doesn't impact the logic of the query and the ability to properly sort.

Also note that this query is deterministic in terms of the dealing with ties, i,e, when a given set of records have a same value for bb (within an aa group); I think the Python program may provide slightly different outputs when the order of the input data is changed, that is because of its occasional truncating of the sorting dictionary.

Real solution: A SQL-based procedural approach

The self-join approach described above demonstrates how declarative statements can be used to express the OP's requirement. However this approach is naive in a sense that its performance is roughly bound to the sum of the squares of record counts within each aa 'category'. (not O(n^2) but roughly O((n/a)^2) where a is the number of different values for the aa column) In other words it performs well with data such that on average the number of records associated with a given aa value doesn't exceed a few dozens. If the data is such that the aa column is not selective, the following approach is much -much!- better suited. It leverages SQL's efficient sorting framework, while implementing a simple algorithm that would be hard to express in declarative fashion. This approach could further be improved for datasets with particularly huge number of records each/most aa 'categories' by introducing a simple binary search of the next aa value, by looking ahead (and sometimes back...) in the cursor. For cases where the number of aa 'categories' relative to the overall row count in tblAbc is low, see yet another approach, after this next one.

DECLARE @aa AS VARCHAR(10), @bb AS INT, @cc AS VARCHAR(10)
DECLARE @curAa AS VARCHAR(10)
DECLARE @Ctr AS INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE abcCursor CURSOR 
  FOR SELECT aa, bb, cc
  FROM tblABC
  ORDER BY aa, bb DESC, cc
  FOR READ ONLY;

OPEN abcCursor;

SET @curAa = ''

FETCH NEXT FROM abcCursor INTO @aa, @bb, @cc;
WHILE @@FETCH_STATUS = 0
BEGIN
    IF @curAa <> @aa
    BEGIN
       SET @Ctr = 0
       SET @curAa = @aa
    END
    IF @Ctr < 5
    BEGIN
       SET @Ctr = @Ctr + 1;
       INSERT tblResults VALUES(@aa, @bb, @cc);
    END
    FETCH NEXT FROM AbcCursor INTO @aa, @bb, @cc;
END;

CLOSE abcCursor;
DEALLOCATE abcCursor;

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

Alternative to the above for cases when aa is very unselective. In other words, when we have relatively few aa 'categories'. The idea is to go through the list of distinct categories and to run a "LIMIT" (MySql) "TOP" (MSSQL) query for each of these values.
For reference purposes, the following ran in 63 seconds for tblAbc of 61 Million records divided in 45 aa values, on MSSQL 8.0, on a relatively old/weak host.

DECLARE @aa AS VARCHAR(10)
DECLARE @aaCount INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE aaCountCursor CURSOR 
  FOR SELECT aa, COUNT(*)
  FROM tblABC
  GROUP BY aa
  ORDER BY aa
  FOR READ ONLY;
OPEN aaCountCursor;


FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount
WHILE @@FETCH_STATUS = 0
BEGIN
    INSERT tblResults 
       SELECT TOP 5 aa, bb, cc
       FROM tblproh
       WHERE aa = @aa
       ORDER BY aa, bb DESC, cc

    FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount;
END;

CLOSE aaCountCursor
DEALLOCATE aaCountCursor

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

On the question of needing an index or not. (cf OP's remark)
When merely running a "SELECT * FROM myTable", a table scan is effectively the fastest appraoch, no need to bother with indexes. However, the main reason why SQL is typically better suited for this kind of things (aside from being the repository where the data has been accumulating in the first place, whereas any external solution needs to account for the time to export the relevant data), is that it can rely on indexes to avoid scanning. Many general purpose languages are far better suited to handle raw processing, but they are fighting an unfair battle with SQL because they need to rebuilt any prior knowledge of the data which SQL has gathered in the course of its data collection / import phase. Since sorting is a typically a time and sometimes space consuming task, SQL and its relatively slower processing power often ends up ahead of alternative solutions.

Also, even without pre-built indexes, modern query optimizers may decide on a plan that involves the creation of a temporary index. And, because sorting is an intrinsic part of DDMS, the SQL servers are generally efficient in that area.

So... Is SQL better?

This said, if we are trying to compare SQL and other languages for pure ETL jobs, i.e. for dealing with heaps (unindexed tables) as its input to perform various transformations and filtering, it is likely that multi-thread-able utilities written in say C, and leveraging efficient sorting libaries, would likely be faster. The determining question to decide on a SQL vs. Non-SQL approach is where the data is located and where should it eventually reside. If we merely to convert a file to be supplied down "the chain" external programs are better suited. If we have or need the data in a SQL server, there are only rare cases that make it worthwhile exporting and processing externally.

回复收藏 0 原文

睫毛溺水了 2024-08-12 09:16:16

您可以使用更智能的数据结构并仍然使用 python。
我已经在我的机器上运行了您的参考实现和我的 python 实现，甚至还比较了输出以确保结果。

这是你的：

$ time python ./ref.py  < data-large.txt  > ref-large.txt

real 1m57.689s
user 1m56.104s
sys 0m0.573s

这是我的：

$ time python ./my.py  < data-large.txt  > my-large.txt

real 1m35.132s
user 1m34.649s
sys 0m0.261s
$ diff my-large.txt ref-large.txt 
$ echo $?
0

这是来源：

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    # We want the top 5 for each distinct value of aa.  There are
    # hundreds of thousands of values of aa.
    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    if len(current) < 5:
        heapq.heappush(current, (bb, cc))
    else:
        if current[0] < (bb, cc):
            heapq.heapreplace(current, (bb, cc))

for aa in top_5:
    current = top_5[aa]
    while len(current) > 0:
        bb, cc = heapq.heappop(current)
        print aa, bb, cc

更新：了解你的极限。
我还对 noop 代码进行了计时，以了解最快的 python 解决方案，其代码类似于原始代码：

$ time python noop.py < data-large.txt  > noop-large.txt

real    1m20.143s
user    1m19.846s
sys 0m0.267s

以及 noop.py 本身：

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    if len(current) < 5:
        current.append((bb, cc))

for aa in top_5:
    current = top_5[aa]
    current.sort()
    for bb, cc in current[-5:]:
        print aa, bb, cc

You could use smarter data structures and still use python.
I've ran your reference implementation and my python implementation on my machine and even compared the output to be sure in results.

This is yours:

$ time python ./ref.py  < data-large.txt  > ref-large.txt

real 1m57.689s
user 1m56.104s
sys 0m0.573s

This is mine:

$ time python ./my.py  < data-large.txt  > my-large.txt

real 1m35.132s
user 1m34.649s
sys 0m0.261s
$ diff my-large.txt ref-large.txt 
$ echo $?
0

And this is the source:

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    # We want the top 5 for each distinct value of aa.  There are
    # hundreds of thousands of values of aa.
    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    if len(current) < 5:
        heapq.heappush(current, (bb, cc))
    else:
        if current[0] < (bb, cc):
            heapq.heapreplace(current, (bb, cc))

for aa in top_5:
    current = top_5[aa]
    while len(current) > 0:
        bb, cc = heapq.heappop(current)
        print aa, bb, cc

Update: Know your limits.
I've also timed a noop code, to know the fastest possible python solution with code similar to the original:

$ time python noop.py < data-large.txt  > noop-large.txt

real    1m20.143s
user    1m19.846s
sys 0m0.267s

And the noop.py itself:

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    if len(current) < 5:
        current.append((bb, cc))

for aa in top_5:
    current = top_5[aa]
    current.sort()
    for bb, cc in current[-5:]:
        print aa, bb, cc

回复收藏 0 原文

想你的星星会说话 2024-08-12 09:16:16

在我的机器上，这花费了 45.7 秒，包含 27M 行数据，如下所示：

42 0.49357 0
96 0.48075 1
27 0.640761 2
8 0.389128 3
75 0.395476 4
24 0.212069 5
80 0.121367 6
81 0.271959 7
91 0.18581 8
69 0.258922 9

你的脚本在这些数据上花费了 1m42，c++ 示例也是 1m46（g++ t.cpp -ot 来编译它，我对 c++ 一无所知）。

Java 6，其实并不重要。输出并不完美，但很容易修复。

package top5;

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Map;
import java.util.TreeMap;

public class Main {

    public static void main(String[] args) throws Exception {
        long start  = System.currentTimeMillis();
        Map<String, Pair[]> top5map = new TreeMap<String, Pair[]>();
        BufferedReader br = new BufferedReader(new FileReader("/tmp/file.dat"));

        String line = br.readLine();
        while(line != null) {
            String parts[] = line.split(" ");

            String key = parts[0];
            double score = Double.valueOf(parts[1]);
            String value = parts[2];
            Pair[] pairs = top5map.get(key);

            boolean insert = false;
            Pair p = null;
            if (pairs != null) {
                insert = (score > pairs[pairs.length - 1].score) || pairs.length < 5;
            } else {
                insert = true;
            }
            if (insert) {
                p = new Pair(score, value);
                if (pairs == null) {
                    pairs = new Pair[1];
                    pairs[0] = new Pair(score, value);
                } else {
                    if (pairs.length < 5) {
                        Pair[] newpairs = new Pair[pairs.length + 1];
                        System.arraycopy(pairs, 0, newpairs, 0, pairs.length);
                        pairs = newpairs;
                    }
                    int k = 0;
                    for(int i = pairs.length - 2; i >= 0; i--) {
                        if (pairs[i].score <= p.score) {
                            pairs[i + 1] = pairs[i];
                        } else {
                            k = i + 1;
                            break;
                        }
                    }
                    pairs[k] = p;
                }
                top5map.put(key, pairs);
            }
            line = br.readLine();
        }
        for(Map.Entry<String, Pair[]> e : top5map.entrySet()) {
            System.out.print(e.getKey());
            System.out.print(" ");
            System.out.println(Arrays.toString(e.getValue()));
        }
        System.out.println(System.currentTimeMillis() - start);
    }

    static class Pair {
        double score;
        String value;

        public Pair(double score, String value) {
            this.score = score;
            this.value = value;
        }

        public int compareTo(Object o) {
            Pair p = (Pair) o;
            return (int)Math.signum(score - p.score);
        }

        public String toString() {
            return String.valueOf(score) + ", " + value;
        }
    }
}

用于伪造数据的 AWK 脚本：

BEGIN {
 for (i = 0; i < 27000000; i++) {
  v = rand();
  k = int(rand() * 100);
  print k " " v " " i;
 }
 exit;
}

This took 45.7s on my machine with 27M rows of data that looked like this:

42 0.49357 0
96 0.48075 1
27 0.640761 2
8 0.389128 3
75 0.395476 4
24 0.212069 5
80 0.121367 6
81 0.271959 7
91 0.18581 8
69 0.258922 9

Your script took 1m42 on this data, the c++ example too 1m46 (g++ t.cpp -o t to compile it, I don't know anything about c++).

Java 6, not that it matters really. Output isn't perfect, but it's easy to fix.

package top5;

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Map;
import java.util.TreeMap;

public class Main {

    public static void main(String[] args) throws Exception {
        long start  = System.currentTimeMillis();
        Map<String, Pair[]> top5map = new TreeMap<String, Pair[]>();
        BufferedReader br = new BufferedReader(new FileReader("/tmp/file.dat"));

        String line = br.readLine();
        while(line != null) {
            String parts[] = line.split(" ");

            String key = parts[0];
            double score = Double.valueOf(parts[1]);
            String value = parts[2];
            Pair[] pairs = top5map.get(key);

            boolean insert = false;
            Pair p = null;
            if (pairs != null) {
                insert = (score > pairs[pairs.length - 1].score) || pairs.length < 5;
            } else {
                insert = true;
            }
            if (insert) {
                p = new Pair(score, value);
                if (pairs == null) {
                    pairs = new Pair[1];
                    pairs[0] = new Pair(score, value);
                } else {
                    if (pairs.length < 5) {
                        Pair[] newpairs = new Pair[pairs.length + 1];
                        System.arraycopy(pairs, 0, newpairs, 0, pairs.length);
                        pairs = newpairs;
                    }
                    int k = 0;
                    for(int i = pairs.length - 2; i >= 0; i--) {
                        if (pairs[i].score <= p.score) {
                            pairs[i + 1] = pairs[i];
                        } else {
                            k = i + 1;
                            break;
                        }
                    }
                    pairs[k] = p;
                }
                top5map.put(key, pairs);
            }
            line = br.readLine();
        }
        for(Map.Entry<String, Pair[]> e : top5map.entrySet()) {
            System.out.print(e.getKey());
            System.out.print(" ");
            System.out.println(Arrays.toString(e.getValue()));
        }
        System.out.println(System.currentTimeMillis() - start);
    }

    static class Pair {
        double score;
        String value;

        public Pair(double score, String value) {
            this.score = score;
            this.value = value;
        }

        public int compareTo(Object o) {
            Pair p = (Pair) o;
            return (int)Math.signum(score - p.score);
        }

        public String toString() {
            return String.valueOf(score) + ", " + value;
        }
    }
}

AWK script to fake the data:

BEGIN {
 for (i = 0; i < 27000000; i++) {
  v = rand();
  k = int(rand() * 100);
  print k " " v " " i;
 }
 exit;
}

回复收藏 0 原文

睫毛上残留的泪 2024-08-12 09:16:16

这是 Common Lisp 中的一个草图。

请注意，对于长文件，使用 READ-LINE 会受到惩罚，因为它为每一行保留一个新字符串。然后使用 READ-LINE 的衍生产品之一，这些衍生产品使用行缓冲区浮动。您还可以检查是否希望哈希表区分大小写。

第二个版本

不再需要拆分字符串，因为我们在这里这样做。这是低级代码，希望能够提高一些速度。它检查一个或多个空格作为字段分隔符以及制表符。

(defun read-a-line (stream)
  (let ((line (read-line stream nil nil)))
    (flet ((delimiter-p (c)
             (or (char= c #\space) (char= c #\tab))))
      (when line
        (let* ((s0 (position-if     #'delimiter-p line))
               (s1 (position-if-not #'delimiter-p line :start s0))
               (s2 (position-if     #'delimiter-p line :start (1+ s1)))
               (s3 (position-if     #'delimiter-p line :from-end t)))
          (values (subseq line 0 s0)
                  (list (read-from-string line nil nil :start s1 :end s2)
                        (subseq line (1+ s3)))))))))

上面的函数返回两个值：键和其余值的列表。

(defun dbscan (top-5-table stream)
   "get triples from each line and put them in the hash table"
  (loop with aa = nil and bbcc = nil do
    (multiple-value-setq (aa bbcc) (read-a-line stream))
    while aa do
    (setf (gethash aa top-5-table)
          (let ((l (merge 'list (gethash aa top-5-table) (list bbcc)
                          #'> :key #'first)))
             (or (and (nth 5 l) (subseq l 0 5)) l)))))


(defun dbprint (table output)
  "print the hashtable contents"
  (maphash (lambda (aa value)
              (loop for (bb cc) in value
                    do (format output "~a ~a ~a~%" aa bb cc)))
           table))

(defun dbsum (input &optional (output *standard-output*))
  "scan and sum from a stream"
  (let ((top-5-table (make-hash-table :test #'equal)))
    (dbscan top-5-table input)
    (dbprint top-5-table output)))


(defun fsum (infile outfile)
   "scan and sum a file"
   (with-open-file (input infile :direction :input)
     (with-open-file (output outfile
                      :direction :output :if-exists :supersede)
       (dbsum input output))))

一些测试数据

(defun create-test-data (&key (file "/tmp/test.data") (n-lines 100000))
  (with-open-file (stream file :direction :output :if-exists :supersede)
    (loop repeat n-lines
          do (format stream "~a ~a ~a~%"
                     (random 1000) (random 100.0) (random 10000)))))

； (create-test-data)

(defun test ()
  (time (fsum "/tmp/test.data" "/tmp/result.data")))

第三版，LispWorks

使用一些 SPLIT-STRING 和 PARSE-FLOAT 函数，否则使用通用 CL。

(defun fsum (infile outfile)
  (let ((top-5-table (make-hash-table :size 50000000 :test #'equal)))
    (with-open-file (input infile :direction :input)
      (loop for line = (read-line input nil nil)
            while line do
            (destructuring-bind (aa bb cc) (split-string '(#\space #\tab) line)
              (setf bb (parse-float bb))
              (let ((v (gethash aa top-5-table)))
                (unless v
                  (setf (gethash aa top-5-table)
                        (setf v (make-array 6 :fill-pointer 0))))
                (vector-push (cons bb cc) v)
                (when (> (length v) 5)
                  (setf (fill-pointer (sort v #'> :key #'car)) 5))))))
    (with-open-file (output outfile :direction :output :if-exists :supersede)
      (maphash (lambda (aa value)
                 (loop for (bb . cc) across value do
                       (format output "~a ~f ~a~%" aa bb cc)))
               top-5-table))))

This is a sketch in Common Lisp

Note that for long files there is a penalty for using READ-LINE, because it conses a fresh string for each line. Then use one of the derivatives of READ-LINE that are floating around that are using a line buffer. Also you might check if you want the hash table be case sensitive or not.

second version

Splitting the string is no longer needed, because we do it here. It is low level code, in the hope that some speed gains will be possible. It checks for one or more spaces as field delimiter and also tabs.

(defun read-a-line (stream)
  (let ((line (read-line stream nil nil)))
    (flet ((delimiter-p (c)
             (or (char= c #\space) (char= c #\tab))))
      (when line
        (let* ((s0 (position-if     #'delimiter-p line))
               (s1 (position-if-not #'delimiter-p line :start s0))
               (s2 (position-if     #'delimiter-p line :start (1+ s1)))
               (s3 (position-if     #'delimiter-p line :from-end t)))
          (values (subseq line 0 s0)
                  (list (read-from-string line nil nil :start s1 :end s2)
                        (subseq line (1+ s3)))))))))

Above function returns two values: the key and a list of the rest.

(defun dbscan (top-5-table stream)
   "get triples from each line and put them in the hash table"
  (loop with aa = nil and bbcc = nil do
    (multiple-value-setq (aa bbcc) (read-a-line stream))
    while aa do
    (setf (gethash aa top-5-table)
          (let ((l (merge 'list (gethash aa top-5-table) (list bbcc)
                          #'> :key #'first)))
             (or (and (nth 5 l) (subseq l 0 5)) l)))))


(defun dbprint (table output)
  "print the hashtable contents"
  (maphash (lambda (aa value)
              (loop for (bb cc) in value
                    do (format output "~a ~a ~a~%" aa bb cc)))
           table))

(defun dbsum (input &optional (output *standard-output*))
  "scan and sum from a stream"
  (let ((top-5-table (make-hash-table :test #'equal)))
    (dbscan top-5-table input)
    (dbprint top-5-table output)))


(defun fsum (infile outfile)
   "scan and sum a file"
   (with-open-file (input infile :direction :input)
     (with-open-file (output outfile
                      :direction :output :if-exists :supersede)
       (dbsum input output))))

some test data

(defun create-test-data (&key (file "/tmp/test.data") (n-lines 100000))
  (with-open-file (stream file :direction :output :if-exists :supersede)
    (loop repeat n-lines
          do (format stream "~a ~a ~a~%"
                     (random 1000) (random 100.0) (random 10000)))))

; (create-test-data)

(defun test ()
  (time (fsum "/tmp/test.data" "/tmp/result.data")))

third version, LispWorks

Uses some SPLIT-STRING and PARSE-FLOAT functions, otherwise generic CL.

(defun fsum (infile outfile)
  (let ((top-5-table (make-hash-table :size 50000000 :test #'equal)))
    (with-open-file (input infile :direction :input)
      (loop for line = (read-line input nil nil)
            while line do
            (destructuring-bind (aa bb cc) (split-string '(#\space #\tab) line)
              (setf bb (parse-float bb))
              (let ((v (gethash aa top-5-table)))
                (unless v
                  (setf (gethash aa top-5-table)
                        (setf v (make-array 6 :fill-pointer 0))))
                (vector-push (cons bb cc) v)
                (when (> (length v) 5)
                  (setf (fill-pointer (sort v #'> :key #'car)) 5))))))
    (with-open-file (output outfile :direction :output :if-exists :supersede)
      (maphash (lambda (aa value)
                 (loop for (bb . cc) across value do
                       (format output "~a ~f ~a~%" aa bb cc)))
               top-5-table))))

回复收藏 0 原文

慢慢从新开始 2024-08-12 09:16:16

这是另一个 OCaml 版本 - 以速度为目标 - 在 Streams 上带有自定义解析器。太长了，但是解析器的某些部分是可以重用的。感谢 peufeu 引发竞争:)

速度：

简单的 ocaml - 27 秒
ocaml 带流解析器 - 15 秒
c 带手动解析器 - 5 秒

编译：

ocamlopt -pp camlp4o code.ml -o caml

代码：

open Printf

let cmp x y = compare (fst x : float) (fst y)
let digit c = Char.code c - Char.code '0'

let rec parse f = parser
  | [< a=int; _=spaces; b=float; _=spaces; 
       c=rest (Buffer.create 100); t >] -> f a b c; parse f t
  | [< >] -> ()
and int = parser
  | [< ''0'..'9' as c; t >] -> int_ (digit c) t
  | [< ''-'; ''0'..'9' as c; t >] -> - (int_ (digit c) t)
and int_ n = parser
  | [< ''0'..'9' as c; t >] -> int_ (n * 10 + digit c) t
  | [< >] -> n
and float = parser
  | [< n=int; t=frem; e=fexp >] -> (float_of_int n +. t) *. (10. ** e)
and frem = parser
  | [< ''.'; r=frem_ 0.0 10. >] -> r
  | [< >] -> 0.0
and frem_ f base = parser
  | [< ''0'..'9' as c; t >] -> 
      frem_ (float_of_int (digit c) /. base +. f) (base *. 10.) t
  | [< >] -> f
and fexp = parser
  | [< ''e'; e=int >] -> float_of_int e
  | [< >] -> 0.0
and spaces = parser
  | [< '' '; t >] -> spaces t
  | [< ''\t'; t >] -> spaces t
  | [< >] -> ()
and crlf = parser
  | [< ''\r'; t >] -> crlf t
  | [< ''\n'; t >] -> crlf t
  | [< >] -> ()
and rest b = parser
  | [< ''\r'; _=crlf >] -> Buffer.contents b
  | [< ''\n'; _=crlf >] -> Buffer.contents b
  | [< 'c; t >] -> Buffer.add_char b c; rest b t
  | [< >] -> Buffer.contents b

let () =
  let all = Array.make 200 [] in
  let each a b c =
    assert (a >= 0 && a < 200);
    match all.(a) with
    | [] -> all.(a) <- [b,c]
    | (bmin,_) as prev::tl -> if b > bmin then
      begin
        let m = List.sort cmp ((b,c)::tl) in
        all.(a) <- if List.length tl < 4 then prev::m else m
      end
  in
  parse each (Stream.of_channel stdin);
  Array.iteri 
    (fun a -> List.iter (fun (b,c) -> printf "%i %f %s\n" a b c))
    all

Here is one more OCaml version - targeted for speed - with custom parser on Streams. Too long, but parts of the parser are reusable. Thanks peufeu for triggering competition :)

Speed :

simple ocaml - 27 sec
ocaml with Stream parser - 15 sec
c with manual parser - 5 sec

Compile with :

ocamlopt -pp camlp4o code.ml -o caml

Code :

open Printf

let cmp x y = compare (fst x : float) (fst y)
let digit c = Char.code c - Char.code '0'

let rec parse f = parser
  | [< a=int; _=spaces; b=float; _=spaces; 
       c=rest (Buffer.create 100); t >] -> f a b c; parse f t
  | [< >] -> ()
and int = parser
  | [< ''0'..'9' as c; t >] -> int_ (digit c) t
  | [< ''-'; ''0'..'9' as c; t >] -> - (int_ (digit c) t)
and int_ n = parser
  | [< ''0'..'9' as c; t >] -> int_ (n * 10 + digit c) t
  | [< >] -> n
and float = parser
  | [< n=int; t=frem; e=fexp >] -> (float_of_int n +. t) *. (10. ** e)
and frem = parser
  | [< ''.'; r=frem_ 0.0 10. >] -> r
  | [< >] -> 0.0
and frem_ f base = parser
  | [< ''0'..'9' as c; t >] -> 
      frem_ (float_of_int (digit c) /. base +. f) (base *. 10.) t
  | [< >] -> f
and fexp = parser
  | [< ''e'; e=int >] -> float_of_int e
  | [< >] -> 0.0
and spaces = parser
  | [< '' '; t >] -> spaces t
  | [< ''\t'; t >] -> spaces t
  | [< >] -> ()
and crlf = parser
  | [< ''\r'; t >] -> crlf t
  | [< ''\n'; t >] -> crlf t
  | [< >] -> ()
and rest b = parser
  | [< ''\r'; _=crlf >] -> Buffer.contents b
  | [< ''\n'; _=crlf >] -> Buffer.contents b
  | [< 'c; t >] -> Buffer.add_char b c; rest b t
  | [< >] -> Buffer.contents b

let () =
  let all = Array.make 200 [] in
  let each a b c =
    assert (a >= 0 && a < 200);
    match all.(a) with
    | [] -> all.(a) <- [b,c]
    | (bmin,_) as prev::tl -> if b > bmin then
      begin
        let m = List.sort cmp ((b,c)::tl) in
        all.(a) <- if List.length tl < 4 then prev::m else m
      end
  in
  parse each (Stream.of_channel stdin);
  Array.iteri 
    (fun a -> List.iter (fun (b,c) -> printf "%i %f %s\n" a b c))
    all

回复收藏 0 原文

抚你发端 2024-08-12 09:16:16

在我迄今为止测试过的所有线程中的程序中，OCaml 版本是最快的，也是最短的之一。（基于代码行的测量有点模糊，但它并不比 Python 版本或 C 或 C++ 版本明显长，而且它明显更快。）

注意：我明白了为什么我之前的运行时间如此不确定！我的 CPU 散热器被灰尘堵塞，导致 CPU 过热。现在我得到了很好的确定性基准时间。我想我现在已经重新完成了该线程中的所有计时测量，因为我有了可靠的计时方法。

以下是迄今为止不同版本在 2700 万行 630 MB 输入数据文件上运行的时序。我使用双核 1.6GHz Celeron 上的 Ubuntu Intrepid Ibex，运行 32 位版本的操作系统（以太网驱动程序在 64 位版本中损坏）。我运行每个程序五次，并报告这五次尝试所花费的时间范围。我使用的是 Python 2.5.2、OpenJDK 1.6.0.0、OCaml 3.10.2、GCC 4.3.2、SBCL 1.0.8.debian 和 Octave 3.0.1。

SquareCog 的 Pig 版本：尚未测试（因为我不能只是apt-get install pig），7 行代码。
mjv 的纯 SQL 版本：尚未测试，但我预计运行时间为几天； 7 行代码。
ygrek 的 OCaml 版本：68.7 秒 ±0.9（15 行代码）。
我的 Python 版本：169 秒 ±4 或 86 秒 ±2（使用 Psyco），16 行代码。
Abbot 基于堆的 Python 版本：177 秒 ±5（18 行代码），或 83 秒 ±5（使用 Psyco）。
下面是我的 C 版本，由 GNU sort -n 组成：90 + 5.5 秒 (±3, ±0.1)，但由于 GNU 的缺陷而给出了错误的答案 < code>sort，22 行代码（包括一行 shell）。hrnt
的 C++ 版本：217 秒 ±3 in 25 代码行。
mjv 的替代基于 SQL 的过程方法：尚未测试，26 行代码。
mjv 的第一个基于 SQL 的过程方法：尚未测试，29 行代码。
peufeu 的带有 Psyco 的 Python 版本：181 秒< /strong> ±4，大约 30 行代码。
Rainer Joswig 的 Common Lisp 版本：478 秒（仅运行一次）42 行代码。
abbot 的 noop.py，故意给出错误的结果来建立下限：尚未测试，15 行代码。
Will Hartung 的 Java 版本：根据 David A. Wheeler 的 SLOCCount，96 秒 ±10 英寸，74 行代码。
Greg 的 Matlab 版本：不起作用。
Schuyler Erle 关于在 Python 版本之一上使用 Pyrex 的建议：尚未尝试。

我认为 Abbot 的版本对我来说比对他们来说相对更糟糕，因为真实的数据集具有高度不均匀的分布：正如我所说，一些 aa 值（“玩家”）有数千行，而其他值只有数千行有一个。

关于 Psyco：我将 Psyco 应用于我的原始代码（以及 abbot 的版本），将其放入 main 函数中，该函数本身将时间减少到大约 140 秒，并调用 psyco.full () 在调用 main() 之前。这增加了大约四行代码。

我可以几乎使用 GNU sort 解决问题，如下所示：

kragen@inexorable:~/devel$ time LANG=C sort -nr infile -o sorted

real    1m27.476s
user    0m59.472s
sys 0m8.549s
kragen@inexorable:~/devel$ time ./top5_sorted_c < sorted > outfile

real    0m5.515s
user    0m4.868s
sys 0m0.452s

这里 top5_sorted_c 是这个简短的 C 程序：

#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

enum { linesize = 1024 };

char buf[linesize];
char key[linesize];             /* last key seen */

int main() {
  int n = 0;
  char *p;

  while (fgets(buf, linesize, stdin)) {
    for (p = buf; *p && !isspace(*p); p++) /* find end of key on this line */
      ;
    if (p - buf != strlen(key) || 0 != memcmp(buf, key, p - buf)) 
      n = 0;                    /* this is a new key */
    n++;

    if (n <= 5)               /* copy up to five lines for each key */
      if (fputs(buf, stdout) == EOF) abort();

    if (n == 1) {               /* save new key in `key` */
      memcpy(key, buf, p - buf);
      key[p-buf] = '\0';
    }
  }
  return 0;
}

我首先尝试用 C++ 编写该程序如下所示，我得到的运行时间要慢得多，为 33.6±2.3 秒，而不是 5.5±0.1 秒：

#include <map>
#include <iostream>
#include <string>

int main() {
  using namespace std;
  int n = 0;
  string prev, aa, bb, cc;

  while (cin >> aa >> bb >> cc) {
    if (aa != prev) n = 0;
    ++n;
    if (n <= 5) cout << aa << " " << bb << " " << cc << endl;
    prev = aa;
  }
  return 0;
}

我确实说几乎。问题是 sort -n 对于大多数数据都可以，但当它尝试将 0.33 与 3.78168e-05 进行比较时会失败。因此，为了获得这种性能并真正解决问题，我需要更好的排序。

不管怎样，我有点像在抱怨，但是排序和过滤方法比 Python 程序大约快 5 倍，而来自 hrnt 的优雅的 STL 程序实际上要慢一些——似乎有某种效率严重低下。我不知道另外 83% 的运行时间在那个小的 C++ 版本的过滤器中去了哪里，但它没有去任何有用的地方，这让我怀疑我不知道它在 hrnt 的 中去了哪里std::map 版本之一。该版本也可以加速 5 倍吗？因为那会很酷。它的工作集可能比我的二级缓存大，但实际上它可能不是。

对 callgrind 的一些调查表明，我的 C++ 过滤程序正在 operator >> 内部执行 97% 的指令。我可以识别每个输入字节至少 10 个函数调用，而 cin.sync_with_stdio(false); 没有帮助。这可能意味着我可以通过更有效地解析输入行来使 hrnt 的 C 程序运行得更快。

编辑：kcachegrind 声称 hrnt 的程序执行了 62% 的指令（在一个 157000 行的小型输入文件上），从 istream 中提取 double。其中很大一部分是因为 istreams 库在尝试解析 double 时显然为每个输入字节执行了大约 13 个函数调用。疯狂的。我可能会误解 kcachegrind 的输出吗？

无论如何，还有其他建议吗？

Of all the programs in this thread that I've tested so far, the OCaml version is the fastest and also among the shortest. (Line-of-code-based measurements are a little fuzzy, but it's not clearly longer than the Python version or the C or C++ versions, and it is clearly faster.)

Note: I figured out why my earlier runtimes were so nondeterministic! My CPU heatsink was clogged with dust and my CPU was overheating as a result. Now I am getting nice deterministic benchmark times. I think I've now redone all the timing measurements in this thread now that I have a reliable way to time things.

Here are the timings for the different versions so far, running on a 27-million-row 630-megabyte input data file. I'm on Ubuntu Intrepid Ibex on a dual-core 1.6GHz Celeron, running a 32-bit version of the OS (the Ethernet driver was broken in the 64-bit version). I ran each program five times and report the range of times those five tries took. I'm using Python 2.5.2, OpenJDK 1.6.0.0, OCaml 3.10.2, GCC 4.3.2, SBCL 1.0.8.debian, and Octave 3.0.1.

SquareCog's Pig version: not yet tested (because I can't just apt-get install pig), 7 lines of code.
mjv's pure SQL version: not yet tested, but I predict a runtime of several days; 7 lines of code.
ygrek's OCaml version: 68.7 seconds ±0.9 in 15 lines of code.
My Python version: 169 seconds ±4 or 86 seconds ±2 with Psyco, in 16 lines of code.
abbot's heap-based Python version: 177 seconds ±5 in 18 lines of code, or 83 seconds ±5 with Psyco.
My C version below, composed with GNU sort -n: 90 + 5.5 seconds (±3, ±0.1), but gives the wrong answer because of a deficiency in GNU sort, in 22 lines of code (including one line of shell.)
hrnt's C++ version: 217 seconds ±3 in 25 lines of code.
mjv's alternative SQL-based procedural approach: not yet tested, 26 lines of code.
mjv's first SQL-based procedural approach: not yet tested, 29 lines of code.
peufeu's Python version with Psyco: 181 seconds ±4, somewhere around 30 lines of code.
Rainer Joswig's Common Lisp version: 478 seconds (only run once) in 42 lines of code.
abbot's noop.py, which intentionally gives incorrect results to establish a lower bound: not yet tested, 15 lines of code.
Will Hartung's Java version: 96 seconds ±10 in, according to David A. Wheeler’s SLOCCount, 74 lines of code.
Greg's Matlab version: doesn't work.
Schuyler Erle's suggestion of using Pyrex on one of the Python versions: not yet tried.

I supect abbot's version comes out relatively worse for me than for them because the real dataset has a highly nonuniform distribution: as I said, some aa values (“players”) have thousands of lines, while others only have one.

About Psyco: I applied Psyco to my original code (and abbot's version) by putting it in a main function, which by itself cut the time down to about 140 seconds, and calling psyco.full() before calling main(). This added about four lines of code.

I can almost solve the problem using GNU sort, as follows:

kragen@inexorable:~/devel$ time LANG=C sort -nr infile -o sorted

real    1m27.476s
user    0m59.472s
sys 0m8.549s
kragen@inexorable:~/devel$ time ./top5_sorted_c < sorted > outfile

real    0m5.515s
user    0m4.868s
sys 0m0.452s

Here top5_sorted_c is this short C program:

#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

enum { linesize = 1024 };

char buf[linesize];
char key[linesize];             /* last key seen */

int main() {
  int n = 0;
  char *p;

  while (fgets(buf, linesize, stdin)) {
    for (p = buf; *p && !isspace(*p); p++) /* find end of key on this line */
      ;
    if (p - buf != strlen(key) || 0 != memcmp(buf, key, p - buf)) 
      n = 0;                    /* this is a new key */
    n++;

    if (n <= 5)               /* copy up to five lines for each key */
      if (fputs(buf, stdout) == EOF) abort();

    if (n == 1) {               /* save new key in `key` */
      memcpy(key, buf, p - buf);
      key[p-buf] = '\0';
    }
  }
  return 0;
}

I first tried writing that program in C++ as follows, and I got runtimes which were substantially slower, at 33.6±2.3 seconds instead of 5.5±0.1 seconds:

#include <map>
#include <iostream>
#include <string>

int main() {
  using namespace std;
  int n = 0;
  string prev, aa, bb, cc;

  while (cin >> aa >> bb >> cc) {
    if (aa != prev) n = 0;
    ++n;
    if (n <= 5) cout << aa << " " << bb << " " << cc << endl;
    prev = aa;
  }
  return 0;
}

I did say almost. The problem is that sort -n does okay for most of the data, but it fails when it's trying to compare 0.33 with 3.78168e-05. So to get this kind of performance and actually solve the problem, I need a better sort.

Anyway, I kind of feel like I'm whining, but the sort-and-filter approach is about 5× faster than the Python program, while the elegant STL program from hrnt is actually a little slower — there seems to be some kind of gross inefficiency in <iostream>. I don't know where the other 83% of the runtime is going in that little C++ version of the filter, but it isn't going anywhere useful, which makes me suspect I don't know where it's going in hrnt's std::map version either. Could that version be sped up 5× too? Because that would be pretty cool. Its working set might be bigger than my L2 cache, but as it happens it probably isn't.

Some investigation with callgrind says my filter program in C++ is executing 97% of its instructions inside of operator >>. I can identify at least 10 function calls per input byte, and cin.sync_with_stdio(false); doesn’t help. This probably means I could get hrnt’s C program to run substantially faster by parsing input lines more efficiently.

Edit: kcachegrind claims that hrnt’s program executes 62% of its instructions (on a small 157000 line input file) extracting doubles from an istream. A substantial part of this is because the istreams library apparently executes about 13 function calls per input byte when trying to parse a double. Insane. Could I be misunderstanding kcachegrind's output?

Anyway, any other suggestions?

回复收藏 0 原文

等你爱我 2024-08-12 09:16:16

非常简单的 Caml（27 * 10^6 行 -- 27 秒，C++ by hrnt -- 29 秒）

open Printf
open ExtLib

let (>>) x f = f x
let cmp x y = compare (fst x : float) (fst y)
let wsp = Str.regexp "[ \t]+"

let () =
  let all = Hashtbl.create 1024 in
  Std.input_lines stdin >> Enum.iter (fun line ->
    let [a;b;c] = Str.split wsp line in
    let b = float_of_string b in
    try
      match Hashtbl.find all a with
      | [] -> assert false
      | (bmin,_) as prev::tl -> if b > bmin then
        begin
          let m = List.sort ~cmp ((b,c)::tl) in
          Hashtbl.replace all a (if List.length tl < 4 then prev::m else m)
        end
    with Not_found -> Hashtbl.add all a [b,c]
  );
  all >> Hashtbl.iter (fun a -> List.iter (fun (b,c) -> printf "%s %f %s\n" a b c))

Pretty straightforward Caml (27 * 10^6 rows -- 27 sec, C++ by hrnt -- 29 sec)

open Printf
open ExtLib

let (>>) x f = f x
let cmp x y = compare (fst x : float) (fst y)
let wsp = Str.regexp "[ \t]+"

let () =
  let all = Hashtbl.create 1024 in
  Std.input_lines stdin >> Enum.iter (fun line ->
    let [a;b;c] = Str.split wsp line in
    let b = float_of_string b in
    try
      match Hashtbl.find all a with
      | [] -> assert false
      | (bmin,_) as prev::tl -> if b > bmin then
        begin
          let m = List.sort ~cmp ((b,c)::tl) in
          Hashtbl.replace all a (if List.length tl < 4 then prev::m else m)
        end
    with Not_found -> Hashtbl.add all a [b,c]
  );
  all >> Hashtbl.iter (fun a -> List.iter (fun (b,c) -> printf "%s %f %s\n" a b c))

回复收藏 0 原文

笑，眼淚并存 2024-08-12 09:16:16

这是一个 C++ 解决方案。然而，我没有太多数据来测试它，所以我不知道它实际上有多快。

[编辑] 感谢本线程中 awk 脚本提供的测试数据，我
设法清理并加快了代码的速度。我并不是想找出最快的版本 - 目的是提供一个相当快的版本，它不像人们认为的 STL 解决方案那么难看。

这个版本的速度大约是第一个版本的两倍（大约 35 秒内完成 2700 万行）。 Gcc 用户请记住
使用 -O2 编译它。

#include <map>
#include <iostream>
#include <functional>
#include <utility>
#include <string>
int main() {
  using namespace std;
  typedef std::map<string, std::multimap<double, string> > Map;
  Map m;
  string aa, cc;
  double bb;
  std::cin.sync_with_stdio(false); // Dunno if this has any effect, but anyways.

  while (std::cin >> aa >> bb >> cc)
    {
      if (m[aa].size() == 5)
        {
          Map::mapped_type::iterator iter = m[aa].begin();
          if (bb < iter->first)
            continue;
          m[aa].erase(iter);
        }
      m[aa].insert(make_pair(bb, cc));
    }
  for (Map::const_iterator iter = m.begin(); iter != m.end(); ++iter)
    for (Map::mapped_type::const_iterator iter2 = iter->second.begin();
         iter2 != iter->second.end();
         ++iter2)
      std::cout << iter->first << " " << iter2->first << " " << iter2->second <<
 std::endl;

}

Here is a C++ solution. I didn't have a lot of data to test it with, however, so I don't know how fast it actually is.

[edit] Thanks to the test data provided by the awk script in this thread, I
managed to clean up and speed up the code a bit. I am not trying to find out the fastest possible version - the intent is to provide a reasonably fast version that isn't as ugly as people seem to think STL solutions can be.

This version should be about twice as fast as the first version (goes through 27 million lines in about 35 seconds). Gcc users, remember to
compile this with -O2.

#include <map>
#include <iostream>
#include <functional>
#include <utility>
#include <string>
int main() {
  using namespace std;
  typedef std::map<string, std::multimap<double, string> > Map;
  Map m;
  string aa, cc;
  double bb;
  std::cin.sync_with_stdio(false); // Dunno if this has any effect, but anyways.

  while (std::cin >> aa >> bb >> cc)
    {
      if (m[aa].size() == 5)
        {
          Map::mapped_type::iterator iter = m[aa].begin();
          if (bb < iter->first)
            continue;
          m[aa].erase(iter);
        }
      m[aa].insert(make_pair(bb, cc));
    }
  for (Map::const_iterator iter = m.begin(); iter != m.end(); ++iter)
    for (Map::mapped_type::const_iterator iter2 = iter->second.begin();
         iter2 != iter->second.end();
         ++iter2)
      std::cout << iter->first << " " << iter2->first << " " << iter2->second <<
 std::endl;

}

回复收藏 0 原文

谈场末日恋爱 2024-08-12 09:16:16

有趣的是，原始的 Python 解决方案是迄今为止最干净的（尽管 C++ 示例很接近）。

在您的原始代码上使用 Pyrex 或 Psyco 怎么样？

回复收藏 0 原文

黑凤梨 2024-08-12 09:16:16

有没有人尝试过只用 awk 来解决这个问题。特别是“mawk”？根据这篇博客文章，它应该比 Java 和 C++ 更快：http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data -munging-language/

编辑：只是想澄清一下，该博客文章中提出的唯一主张是，对于特别适合 awk 式处理的某一类问题，mawk 虚拟机可以击败 ' Java 和 C++ 中的 vanilla 实现。

回复收藏 0 原文

岁月无声 2024-08-12 09:16:16

既然您询问了 Matlab，这就是我如何完成您所要求的事情。我尝试在不使用任何 for 循环的情况下完成此操作，但我确实有一个，因为我不想花很长时间。如果您担心内存问题，那么您可以使用 fscanf 从流中分块提取数据，而不是读取整个缓冲区。

fid = fopen('fakedata.txt','r');
tic
A=fscanf(fid,'%d %d %d\n');
A=reshape(A,3,length(A)/3)';  %Matlab reads the data into one long column'
Names = unique(A(:,1));
for i=1:length(Names)
    indices = find(A(:,1)==Names(i));   %Grab all instances of key i
    [Y,I] = sort(A(indices,2),1,'descend'); %sort in descending order of 2nd record
    A(indices(I(1:min([5,length(indices(I))]))),:) %Print the top five
end
toc
fclose(fid)

Since you asked about Matlab, here's how I did something like what you're asking for. I tried to do it without any for loops, but I do have one because I didn't care to take a long time with it. If you were worried about memory then you could pull data from the stream in chunks with fscanf rather than reading the entire buffer.

fid = fopen('fakedata.txt','r');
tic
A=fscanf(fid,'%d %d %d\n');
A=reshape(A,3,length(A)/3)';  %Matlab reads the data into one long column'
Names = unique(A(:,1));
for i=1:length(Names)
    indices = find(A(:,1)==Names(i));   %Grab all instances of key i
    [Y,I] = sort(A(indices,2),1,'descend'); %sort in descending order of 2nd record
    A(indices(I(1:min([5,length(indices(I))]))),:) %Print the top five
end
toc
fclose(fid)

回复收藏 0 原文

三月梨花 2024-08-12 09:16:16

说到计算时间的下限：

让我们分析上面的算法：

for each row (key,score,id) :
    create or fetch a list of top scores for the row's key
    if len( this list ) < N
        append current
    else if current score > minimum score in list
        replace minimum of list with current row
        update minimum of all lists if needed

设 N 为 top-N 中的 N
令 R 为数据集中的行数
令 K 为不同键的数量

我们可以做出什么假设？

R * sizeof( 行 ) > RAM 或至少足够大，我们不想加载全部内容，使用散列按键分组，并对每个容器进行排序。出于同样的原因，我们不会对所有内容进行排序。

Kragen 喜欢哈希表，因此 K * sizeof(per-key state) << RAM，很可能它适合 L2/3 缓存

Kragen 没有排序，所以 K*N << R 即每个 key 的条目数远多于 N 个

（注：A << B 表示 A 相对于 B 较小）

如果数据具有随机分布，那么

在少量行之后，大多数行将被拒绝根据每个键的最小条件，成本是每行 1 次比较。

因此，每行的成本是 1 次哈希查找 + 1 次比较 + epsilon *（列表插入 + 最小值的 (N+1) 次比较）

如果分数具有随机分布（例如在 0 和 1 之间）并且上述条件成立，则两者都成立epsilon 将非常小。

实验证明：

上面的 2700 万行数据集在前 N 个列表中产生了 5933 个插入。所有其他行都会被简单的键查找和比较拒绝。 epsilon = 0.0001

因此，粗略地说，成本是每行 1 次查找 + 比较，这需要几纳秒。

在当前的硬件上，与 IO 成本（尤其是解析成本）相比，这绝对是可以忽略不计的。

Speaking of lower bounds on compute time :

Let's analyze my algo above :

for each row (key,score,id) :
    create or fetch a list of top scores for the row's key
    if len( this list ) < N
        append current
    else if current score > minimum score in list
        replace minimum of list with current row
        update minimum of all lists if needed

Let N be the N in top-N
Let R be the number of rows in your data set
Let K be the number of distinct keys

What assumptions can we make ?

R * sizeof( row ) > RAM or at least it's big enough that we don't want to load it all, use a hash to group by key, and sort each bin. For the same reason we don't sort the whole stuff.

Kragen likes hashtables, so K * sizeof(per-key state) << RAM, most probably it fits in L2/3 cache

Kragen is not sorting, so K*N << R ie each key has much more than N entries

(note : A << B means A is small relative to B)

If the data has a random distribution, then

after a small number of rows, the majority of rows will be rejected by the per-key minimum condition, the cost is 1 comparison per row.

So the cost per row is 1 hash lookup + 1 comparison + epsilon * (list insertion + (N+1) comparisons for the minimum)

If the scores have a random distribution (say between 0 and 1) and the conditions above hold, both epsilons will be very small.

Experimental proof :

The 27 million rows dataset above produces 5933 insertions into the top-N lists. All other rows are rejected by a simple key lookup and comparison. epsilon = 0.0001

So roughly, the cost is 1 lookup + coparison per row, which takes a few nanoseconds.

On current hardware, there is no way this is not going to be negligible versus IO cost and especially parsing costs.

回复收藏 0 原文

情栀口红 2024-08-12 09:16:16

这不就是这么简单吗

 SELECT DISTINCT aa, bb, cc FROM tablename ORDER BY bb DESC LIMIT 5

？

当然，如果不根据数据进行测试，很难判断什么是最快的。如果您需要运行得非常快，那么优化数据库以使查询更快可能比优化查询更有意义。

当然，如果您无论如何都需要平面文件，您也可以使用它。

Isn't this just as simple as

 SELECT DISTINCT aa, bb, cc FROM tablename ORDER BY bb DESC LIMIT 5

Of course, it's hard to tell what would be fastest without testing it against the data. And if this is something you need to run very fast, it might make sense to optimize your database to make the query faster, rather than optimizing the query.

And, of course, if you need the flat file anyway, you might as well use that.

回复收藏 0 原文

遗失的美好 2024-08-12 09:16:16

选择“前 5 个”看起来像这样。请注意，没有排序。 top_5 字典中的任何列表也不会超过 5 个元素。

from collections import defaultdict
import sys

def keep_5( aList, aPair ):
    minbb= min( bb for bb,cc in aList )
    bb, cc = aPair
    if bb < minbb: return aList
    aList.append( aPair )
    min_i= 0
    for i in xrange(1,6):
        if aList[i][0] < aList[min_i][0]
            min_i= i
    aList.pop(min_i)
    return aList


top_5= defaultdict(list)
for row in sys.stdin:
    aa, bb, cc = row.split()
    bb = float(bb)
    if len(top_5[aa]) < 5:
        top_5[aa].append( (bb,cc) )
    else:
        top_5[aa]= keep_5( top_5[aa], (bb,cc) )

Pick "top 5" would look something like this. Note that there's no sorting. Nor does any list in the top_5 dictionary ever grow beyond 5 elements.

from collections import defaultdict
import sys

def keep_5( aList, aPair ):
    minbb= min( bb for bb,cc in aList )
    bb, cc = aPair
    if bb < minbb: return aList
    aList.append( aPair )
    min_i= 0
    for i in xrange(1,6):
        if aList[i][0] < aList[min_i][0]
            min_i= i
    aList.pop(min_i)
    return aList


top_5= defaultdict(list)
for row in sys.stdin:
    aa, bb, cc = row.split()
    bb = float(bb)
    if len(top_5[aa]) < 5:
        top_5[aa].append( (bb,cc) )
    else:
        top_5[aa]= keep_5( top_5[aa], (bb,cc) )

回复收藏 0 原文

随遇而安 2024-08-12 09:16:16

Pig 版本会是这样的（未经测试）：

 Data = LOAD '/my/data' using PigStorage() as (aa:int, bb:float, cc:chararray);
 grp = GROUP Data by aa;
 topK = FOREACH grp (
     sorted = ORDER Data by bb DESC;
     lim = LIMIT sorted 5;
     GENERATE group as aa, lim;
)
STORE topK INTO '/my/output' using PigStorage();

Pig 没有针对性能进行优化；它的目标是使用并行执行框架处理多 TB 数据集。它确实有本地模式，所以你可以尝试一下，但我怀疑它会打败你的脚本。

The Pig version would go something like this (untested):

 Data = LOAD '/my/data' using PigStorage() as (aa:int, bb:float, cc:chararray);
 grp = GROUP Data by aa;
 topK = FOREACH grp (
     sorted = ORDER Data by bb DESC;
     lim = LIMIT sorted 5;
     GENERATE group as aa, lim;
)
STORE topK INTO '/my/output' using PigStorage();

Pig isn't optimized for performance; it's goal is to enable processing of multi-terabyte datasets using parallel execution frameworks. It does have a local mode, so you can try it, but I doubt it will beat your script.

回复收藏 0 原文

柳絮泡泡 2024-08-12 09:16:16

这是一个很好的午休挑战，他，他。

Top-N是众所周知的数据库杀手。正如上面的帖子所示，没有办法用普通的 SQL 来有效地表达它。

至于各种实现，您必须记住，其中缓慢的部分不是排序或前 N 项，而是文本的解析。你最近看过 glibc 的 strtod() 的源代码吗？

例如，我得到，使用Python：

Read data : 80.5  s
My TopN   : 34.41 s
HeapTopN  : 30.34 s

无论您使用什么语言，您很可能永远不会获得非常快的计时，除非您的数据采用某种比文本解析速度快得多的格式。例如，将测试数据加载到 postgres 需要 70 秒，其中大部分也是文本解析。

如果 topN 中的 N 很小，比如 5，那么下面我的算法的 C 实现可能是最快的。如果 N 可以更大，堆是一个更好的选择。

因此，由于您的数据可能在数据库中，并且您的问题是获取数据，而不是实际处理，如果您确实需要一个超快的 TopN 引擎，那么您应该做的是为您的数据编写一个 C 模块选择的数据库。由于 postgres 在任何方面都更快，我建议使用 postgres，而且为其编写 C 模块并不困难。

这是我的 Python 代码：

import random, sys, time, heapq

ROWS = 27000000

def make_data( fname ):
    f = open( fname, "w" )
    r = random.Random()
    for i in xrange( 0, ROWS, 10000 ):
        for j in xrange( i,i+10000 ):
            f.write( "%d    %f    %d\n" % (r.randint(0,100), r.uniform(0,1000), j))
        print ("write: %d\r" % i),
        sys.stdout.flush()
    print

def read_data( fname ):
    for n, line in enumerate( open( fname ) ):
        r = line.strip().split()
        yield int(r[0]),float(r[1]),r[2]
        if not (n % 10000 ):
            print ("read: %d\r" % n),
            sys.stdout.flush()
    print

def topn( ntop, data ):
    ntop -= 1
    assert ntop > 0
    min_by_key = {}
    top_by_key = {}
    for key,value,label in data:
        tup = (value,label)
        if key not in top_by_key:
            # initialize
            top_by_key[key] = [ tup ]
        else:
            top = top_by_key[ key ]
            l    = len( top )
            if l > ntop:
                # replace minimum value in top if it is lower than current value
                idx = min_by_key[ key ]
                if top[idx] < tup:
                    top[idx] = tup
                    min_by_key[ key ] = top.index( min( top ) )
            elif l < ntop:
                # fill until we have ntop entries
                top.append( tup )
            else:
                # we have ntop entries in list, we'll have ntop+1
                top.append( tup )
                # initialize minimum to keep
                min_by_key[ key ] = top.index( min( top ) )

    # finalize:
    return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )

def grouptopn( ntop, data ):
    top_by_key = {}
    for key,value,label in data:
        if key in top_by_key:
            top_by_key[ key ].append( (value,label) )
        else:
            top_by_key[ key ] = [ (value,label) ]

    return dict( (key, sorted( values, reverse=True )[:ntop]) for key,values in top_by_key.iteritems() )

def heaptopn( ntop, data ):
    top_by_key = {}
    for key,value,label in data:
        tup = (value,label)
        if key not in top_by_key:
            top_by_key[ key ] = [ tup ]
        else:
            top = top_by_key[ key ]
            if len(top) < ntop:
                heapq.heappush(top, tup)
            else:
                if top[0] < tup:
                    heapq.heapreplace(top, tup)

    return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )

def dummy( data ):
    for row in data:
        pass

make_data( "data.txt" )

t = time.clock()
dummy( read_data( "data.txt" ) )
t_read = time.clock() - t

t = time.clock()
top_result = topn( 5, read_data( "data.txt" ) )
t_topn = time.clock() - t

t = time.clock()
htop_result = heaptopn( 5, read_data( "data.txt" ) )
t_htopn = time.clock() - t

# correctness checking :
for key in top_result:
    print key, " : ", "        ".join (("%f:%s"%(value,label)) for (value,label) in    top_result[key])
    print key, " : ", "        ".join (("%f:%s"%(value,label)) for (value,label) in htop_result[key])

print
print "Read data :", t_read
print "TopN :     ", t_topn - t_read
print "HeapTopN : ", t_htopn - t_read

for key in top_result:
    assert top_result[key] == htop_result[key]

That was a nice lunch break challenge, he, he.

Top-N is a well-known database killer. As shown by the post above, there is no way to efficiently express it in common SQL.

As for the various implementations, you got to keep in mind that the slow part in this is not the sorting or the top-N, it's the parsing of text. Have you looked at the source code for glibc's strtod() lately ?

For instance, I get, using Python :

Read data : 80.5  s
My TopN   : 34.41 s
HeapTopN  : 30.34 s

It is quite likely that you'll never get very fast timings, no matter what language you use, unless your data is in some format that is a lot faster to parse than text. For instance, loading the test data into postgres takes 70 s, and the majority of that is text parsing, too.

If the N in your topN is small, like 5, a C implementation of my algorithm below would probably be the fastest. If N can be larger, heaps are a much better option.

So, since your data is probably in a database, and your problem is getting at the data, not the actual processing, if you're really in need of a super fast TopN engine, what you should do is write a C module for your database of choice. Since postgres is faster for about anything, I suggest using postgres, plus it isn't difficult to write a C module for it.

Here's my Python code :

import random, sys, time, heapq

ROWS = 27000000

def make_data( fname ):
    f = open( fname, "w" )
    r = random.Random()
    for i in xrange( 0, ROWS, 10000 ):
        for j in xrange( i,i+10000 ):
            f.write( "%d    %f    %d\n" % (r.randint(0,100), r.uniform(0,1000), j))
        print ("write: %d\r" % i),
        sys.stdout.flush()
    print

def read_data( fname ):
    for n, line in enumerate( open( fname ) ):
        r = line.strip().split()
        yield int(r[0]),float(r[1]),r[2]
        if not (n % 10000 ):
            print ("read: %d\r" % n),
            sys.stdout.flush()
    print

def topn( ntop, data ):
    ntop -= 1
    assert ntop > 0
    min_by_key = {}
    top_by_key = {}
    for key,value,label in data:
        tup = (value,label)
        if key not in top_by_key:
            # initialize
            top_by_key[key] = [ tup ]
        else:
            top = top_by_key[ key ]
            l    = len( top )
            if l > ntop:
                # replace minimum value in top if it is lower than current value
                idx = min_by_key[ key ]
                if top[idx] < tup:
                    top[idx] = tup
                    min_by_key[ key ] = top.index( min( top ) )
            elif l < ntop:
                # fill until we have ntop entries
                top.append( tup )
            else:
                # we have ntop entries in list, we'll have ntop+1
                top.append( tup )
                # initialize minimum to keep
                min_by_key[ key ] = top.index( min( top ) )

    # finalize:
    return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )

def grouptopn( ntop, data ):
    top_by_key = {}
    for key,value,label in data:
        if key in top_by_key:
            top_by_key[ key ].append( (value,label) )
        else:
            top_by_key[ key ] = [ (value,label) ]

    return dict( (key, sorted( values, reverse=True )[:ntop]) for key,values in top_by_key.iteritems() )

def heaptopn( ntop, data ):
    top_by_key = {}
    for key,value,label in data:
        tup = (value,label)
        if key not in top_by_key:
            top_by_key[ key ] = [ tup ]
        else:
            top = top_by_key[ key ]
            if len(top) < ntop:
                heapq.heappush(top, tup)
            else:
                if top[0] < tup:
                    heapq.heapreplace(top, tup)

    return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )

def dummy( data ):
    for row in data:
        pass

make_data( "data.txt" )

t = time.clock()
dummy( read_data( "data.txt" ) )
t_read = time.clock() - t

t = time.clock()
top_result = topn( 5, read_data( "data.txt" ) )
t_topn = time.clock() - t

t = time.clock()
htop_result = heaptopn( 5, read_data( "data.txt" ) )
t_htopn = time.clock() - t

# correctness checking :
for key in top_result:
    print key, " : ", "        ".join (("%f:%s"%(value,label)) for (value,label) in    top_result[key])
    print key, " : ", "        ".join (("%f:%s"%(value,label)) for (value,label) in htop_result[key])

print
print "Read data :", t_read
print "TopN :     ", t_topn - t_read
print "HeapTopN : ", t_htopn - t_read

for key in top_result:
    assert top_result[key] == htop_result[key]

回复收藏 0 原文

谎言月老 2024-08-12 09:16:16

我喜欢午休挑战。这是一个 1 小时的实现。

好吧，当您不想做一些极其奇特的垃圾（例如加法）时，没有什么可以阻止您使用自定义的基数 10 浮点格式，其唯一实现的运算符是比较，对吗？哈哈。

我有一些来自之前项目的 fast-atoi 代码，所以我只是将其导入。

http://www.copypastecode.com/11541/

此 C 源代码大约需要 6.6 秒解析 580MB 的输入文本（2700 万行），一半的时间是 fgets，哈哈。然后计算 top-n 大约需要 0.05 秒，但我不确定，因为 top-n 所需的时间小于计时器噪音。

你将成为测试其正确性的人 XDDDDDDDDDDDD

有趣吧？

回复收藏 0 原文

半衬遮猫 2024-08-12 09:16:16

好吧，请喝杯咖啡并阅读 strtod 的源代码——它令人难以置信，但如果你想漂浮的话，这是必需的 ->文本-> float 返回与您开始使用的相同的浮点......真的......

解析整数要快得多（虽然在 python 中不是那么快，但在 C 中，是的）。

无论如何，将数据放入 Postgres 表中：

SELECT count( key ) FROM the dataset in the above program

=> 7 秒（因此读取 27M 记录需要 7 秒）

CREATE INDEX topn_key_value ON topn( key, value );

191 秒

CREATE TEMPORARY TABLE topkeys AS SELECT key FROM topn GROUP BY key;

12 秒

（您也可以使用索引更快地获取“key”的不同值，但需要一些轻微的 plpgsql 黑客攻击）

CREATE TEMPORARY TABLE top AS SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1) AS r FROM topkeys a) foo;

临时：15,310 毫秒

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 1) AS r FROM topkeys a) foo;

临时：17,853 毫秒

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 2) AS r FROM topkeys a) foo;

Temps : 13,983 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 3) AS r FROM topkeys a) foo;

Temps : 16,860 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 4) AS r FROM topkeys a) foo;

Temps : 17,651 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 5) AS r FROM topkeys a) foo;

Temps : 19,216 ms

SELECT * FROM top ORDER BY key,value;

如您所见，计算 top-n 非常快（假设 n 很小），但创建（强制）索引非常慢，因为它涉及完整排序。

最好的选择是使用快速解析的格式（二进制或为数据库编写自定义 C 聚合，恕我直言，这将是最佳选择）。如果Python可以在1秒内完成，那么C程序的运行时间不应超过1秒。

Well, please grab a coffee and read the source code for strtod -- it's mindboggling, but needed, if you want to float -> text -> float to give back the same float you started with.... really...

Parsing integers is a lot faster (not so much in python, though, but in C, yes).

Anyway, putting the data in a Postgres table :

SELECT count( key ) FROM the dataset in the above program

=> 7 s (so it takes 7 s to read the 27M records)

CREATE INDEX topn_key_value ON topn( key, value );

191 s

CREATE TEMPORARY TABLE topkeys AS SELECT key FROM topn GROUP BY key;

12 s

(You can use the index to get distinct values of 'key' faster too but it requires some light plpgsql hacking)

CREATE TEMPORARY TABLE top AS SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1) AS r FROM topkeys a) foo;

Temps : 15,310 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 1) AS r FROM topkeys a) foo;

Temps : 17,853 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 2) AS r FROM topkeys a) foo;

Temps : 13,983 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 3) AS r FROM topkeys a) foo;

Temps : 16,860 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 4) AS r FROM topkeys a) foo;

Temps : 17,651 ms

INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 5) AS r FROM topkeys a) foo;

Temps : 19,216 ms

SELECT * FROM top ORDER BY key,value;

As you can see computing the top-n is extremely fast (provided n is small) but creating the (mandatory) index is extremely slow because it involves a full sort.

Your best bet is to use a format that is fast to parse (either binary, or write a custom C aggregate for your database, which would be the best choice IMHO). The runtime in the C program shouldn't be more than 1s if python can do it in 1 s.

回复收藏 0 原文

~没有更多了~