当前位置：文江博客话题详情

在巨大的数据集上计算令牌计数器

发布于 2024-09-26 15:23:50 字数 1106 浏览 11 评论 0原文

我需要检查大量文本（> 2 Tb，维基百科完整转储）并为每个看到的标记保留两个计数器（每个计数器根据当前事件递增）。我需要对这些计数器执行的唯一操作是增加。在第二阶段，我应该根据这些计数器计算两个浮点数并存储它们。

它应该执行以下步骤：

检查大量文本并根据当前事件为找到的每个单词增加两个计数器。
检查所有标记，对于每个标记，根据这些计数器计算两个额外的浮点数。
允许查询（获取任何给定标记的值）。

要求和其他细节：

它必须扩展到 O(10^8) 个令牌。
最终结果需要非常快的查询！
在检查文本时，只会增加两个计数器。这是一次性处理，因此处理过程中不会有任何查询。仅值更新。
不需要动态/可更新的模式。

我一直在尝试 CouchDB 和 MongoDB，但没有取得很好的结果。

您认为解决这个问题的最佳方法是什么？

谢谢你！

编辑1：建议我尝试 Patricia trie 并进行测试如果所有的键都适合内存（我怀疑它们不适合）。带有额外运算符的自定义 Patricia trie 可能是一种可能的解决方案，用于在一步中增加每个键的值。

编辑2：澄清了我所说的“巨大”的含义：> 2 Tb 文本。更多澄清。

编辑3：独特的令牌估计。根据 Mike Dunlavey 的建议，我尝试快速估计独特的标记。在数据集的前 830Mb 中，唯一标记线性增长到 52134。除非在处理更多数据后唯一标记的数量增长较慢（这是可能的），否则应该有 O(10^8) 个唯一标记。

编辑 4： Java 和 Python 解决方案是首选，但任何其他语言也可以。

编辑 5： 通常标记仅包含可打印的 ASCII 字符，但它们可以包含任何可打印的 Unicode 字符。我将尝试相同的过程，小写和大写都保持不变；并且仅适用于小写。

原文

I need to go over a huge amount of text (> 2 Tb, a Wikipedia full dump) and keep two counters for each seen token (each counter is incremented depending on the current event). The only operation that I will need for these counters is increase. On a second phase, I should calculate two floats based on these counters and store them.

It should perform the following steps:

Go over huge amounts of text and increase two counters for each word it finds, depending on the current event.
Go over all tokens and, for each of them, compute two additional floats based on these counters.
Allow queries (getting the values for any given token).

Requirements and other details:

It must scale up to O(10^8) tokens.
The final result needs to be queried very fast!
While going over the texts, only increasing of two counters will be done. This is a one-time processing, so there will be no queries during processing. Only value updating.
No need for dynamic/updateable schema.

I have been trying CouchDB and MongoDB without too good results.

What do you think is the best approach to this problem?

Thank you!

EDIT 1: I have been suggested to try a Patricia trie and test if all the keys fit into memory (I suspect they do not). A custom Patricia trie with an extra operator for increasing the values of each key in one step might be a possible solution.

EDIT 2: Clarified what I mean by "huge": > 2 Tb of text. More clarifications.

EDIT 3: Unique token estimation. As suggested by Mike Dunlavey, I tried to do a quick estimation of the unique tokens. In the first 830Mb of the dataset, unique tokens grow linearly to 52134. Unless the number of unique tokens grows slower after processing more data (which is likely), there should be O(10^8) unique tokens.

EDIT 4: Java and Python solutions are preferred but any other language is ok too.

EDIT 5: Usually tokens will contain only printable ASCII characters, but they can contain any printable Unicode character. I will try the same process both with lower and upper-case untouched; and for lower-case only.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖风昔人 2024-10-03 15:23:50

是策略，而不是解决方案；

一个进程无法逃避对输入数据的通读，即我不知道如何并行化初始操作，除非文件位于并行 I/O 系统上，即便如此，我认为处理 7z 可能会很困难并行文件。

但是，您可以尝试实现一个进程，该进程读取输入数据并将其块写入文件系统，最好写入足够多的不同磁盘上，以便您接下来要启动的进程不会全部排队等待相同的读取/写头。

一旦第一个块被写入，您就在另一个核心上启动一个进程（您已经有了多核不是吗？甚至可能是工作站集群或网络？）来开始消化该块。此过程将部分结果写入文件。

一旦第二个块被写入，您就在另一个核心上启动一个进程......

您得到了图片

一旦处理了整个输入，您就可以设计任务来合并处理每个任务的输出的结果块。您可以以某种级联的方式执行此操作（例如，如果您有 32 个块和 16 个处理器，则可能每个处理器合并 2 个块，然后其中 8 个处理器合并 2 个合并的块，依此类推）。

我最好的猜测是，您应该对此使用平面文件，不确定数据库的额外功能是否值得额外的成本（就性能和编程复杂性而言）。我想您可能想将最终结果写入数据库以支持查询。

编辑：好吧，如果您所有的查询都是“获取令牌 XXX 的计数器”的形式，那么您可以通过单个排序的文本文件进行二进制搜索。我并不是建议您这样做，但它可能会为您指明解决方案的方向。暂时忘记令牌可能以任何字符开头（这只是字母表的问题），您可能有 26 个文件，一个用于以 A 开头的令牌，一个用于以 B 开头的令牌，依此类推。

或者，您可以在主文件中构造一个索引，其中包含 A（距文件开头的偏移量 0）B（距文件开头的偏移量 12456）等条目。

就我个人而言，我会尝试一下每个首字母一个排序文本文件的方法，直到找到一个可行的解决方案，然后弄清楚它是否足够快。但我可以访问具有大量磁盘和大量 RAM 的大型集群，而您的平台可能需要另一种可能更复杂的方法。

A strategy, rather than a solution;

There's no escaping a read-through of the input data by one process, ie I don't see how to parallelise the initial operation unless the file is on a parallel I/O system and even then I think it might be difficult tackling a 7z file in parallel.

However, what you could try is to implement a process which reads the input data and writes chunks of it across your file system, preferably onto enough different disks that the processes you are going to start next don't all queue for the same read/write heads.

As soon as the first chunk has been written you start up a process on another core (you have got multicore haven't you ? possibly even a cluster or network of workstations ?) to start digesting that chunk. This process writes partial results to file(s).

As soon as the second chunk has been written you start up a process on another core ...

... you get the picture

Once the entire input has been processed you then devise tasks to merge the results from the output of the tasks processing each chunk. You'd do this in some sort of cascade (eg if you had 32 chunks and 16 processors you might have each merge 2 chunks, then 8 of them merge 2 of the merged chunks, and so on).

My best guess is that you should be fine with flat files for this, not sure that the extra power of a DB is worth the extra cost (in terms of performance and complexity of programming). I suppose you might want to write the final results into a database to support queries.

EDIT: well, if all your queries are of the form 'get me the counters for token XXX' then you could get away with a binary search through a single sorted text file. I'm not suggesting that you should but it might point you in the direction of a solution. Forgetting for the time being that tokens might start with any character (which is just a matter of the alphabet) you could have 26 files, one for tokens starting with A, one for tokens starting with B, and so on.

Or you could construct an index into the master file with entries for A (offset 0 from start of file) B (offset 12456 from start) and so on.

Personally I'd play around a bit with the one-sorted-text-file-per-initial-letter approach until I had a working solution, then figure out whether it was fast enough. But I have access to large clusters with oodles of disk and lashings of RAM and your platform might dictate another, possibly more sophisticated, approach.

回复收藏 0 原文

偏闹i 2024-10-03 15:23:50

据我了解，您只想计算令牌。第一个解决方案可能只是使用内存中的哈希图。 52-100k 个标记（英语单词的优势长度约为 5.1）+ 每个标记 4 个字节用于计数并不是那么多数据。您可以轻松地将地图存储在开发人员机器的内存中。

第二个解决方案是使用apache lucene来存储新的令牌——除非你没有1M条目，否则不需要对索引进行分区——以及我将存储在数据库中的计数器值，例如sqllite（因为更新lucene 索引不是最好的主意）。

为了加快这个过程——对于这两种解决方案——我只是将你的数据集分割成 k*100 数据集，并在不同的机器上单独运行它们（或并行），然后合并它们的结果。计算结果，您可以毫无问题地进行求和。

您的用例是 apache hadoop 教程中的经典示例，但我认为部署它会过度设计。

回复收藏 0 原文

夢归不見 2024-10-03 15:23:50

如果您有大量内存，则可以使用普通的 redis 来存储计数器 (10^我猜 8 个带有两个计数器的独特令牌每个大约需要 12GB）。

如果你没有那么多内存，你仍然可以使用 redis，但需要一点散列策略和 vm_enabled 来使其适合内存：

你可以将标记除以第一个和第二个字母（aa、ab、ac...zz ）作为哈希名称，实际单词+令牌标识符作为哈希键，计数作为值。它看起来像这样：

hash ab
- absence_c1 5
- absence_c2 2
- abandon_c1 2
- abandon_c1 10
hash st
- stack_c1 10
- stack_c2 14

但是在这种方法中，由于 redis 无法在哈希上“增加”，您将获得先前的值，然后将它们增加并将其设置回来，这样（伪代码）：

var last = redis("hget st stack_c1")
var actual = last + 1
redis("hset st stack_c1 actual")

使用此哈希模式并启用 vm redis 将保持较低的内存使用率，同时仍然足够快。我能够存储 200 万个令牌，每个令牌有 15 个字符，只使用了 100MB 的内存和近 4G 的磁盘。

If you have a lot of memory you could just use plain redis for storing the counters (10^8 unique tokens with two counters each would take around 12GB I guess).

If you don't have that much memory them you could still use redis, but with a little hashing strategy and vm_enabled to make it fit memory:

You could have tokens divided by first and second letters (aa, ab, ac... zz) for the hash name, and the actual word + token identifier as hash key, and the counte as the value. It would look like this:

hash ab
- absence_c1 5
- absence_c2 2
- abandon_c1 2
- abandon_c1 10
hash st
- stack_c1 10
- stack_c2 14

But in this approach as redis can't "incr" on hashes you would get the previous value and them incr and set it back, this way (pseudo-code):

var last = redis("hget st stack_c1")
var actual = last + 1
redis("hset st stack_c1 actual")

Using this hash pattern and with vm enabled redis will keep memory usage low while still fast enough. I was able to store 2 millions tokens, with 15 chars each, using less them 100MB of ram and almost 4G of disk.

回复收藏 0 原文

许一世地老天荒 2024-10-03 15:23:50

高级解决方案：

解析输入，将“[token] +X +Y”行输出到 N 个输出文件中的 1 个（这些“分片”输出文件中的每一个都足够小，可以在内存中处理。）
[对于每个文件]将其读入内存，输出带有“[token] [count1] [count2] ...”行的排序文件
在查询时，对正确的文件进行二分搜索

详细信息：
这是步骤 1 的 Python 伪代码

NUM_SHARDS = 1000  # big enough to make each file fit in memory  
output_files = [open("file" + str(n), "w") for n in xrange(NUM_SHARDS)]
for token in input_stream:
   shard_id = hash(token) % NUM_SHARDS
   output_files[shard_id].write(token + " +0 +1\n")
   # TODO: output the correct +X and +Y as needed

这是步骤 2 的 Python 伪代码

input_files = [open("file" + str(n)) for n in xrange(NUM_SHARDS)]
for file in input_files:
   counts = {}   # Key: token   Value: { "count1": 0, "count2": 1 }

   # read the file, and populate 'counts'
   for line in file:
      (token, count1, count2) = line.split(" ")
      # make sure we have a value for this token
      counts.setdefault(token, { "count1": 0, "count2": 0 })
      counts[token]["count1"] += int(count1)
      counts[token]["count2"] += int(count2)
      # TODO: compute those floats, and stuff those inside 'counts' also

   # now write 'counts' out to a file (in sorted order)
   output_file = open(file.name + ".index", "w")
   for token, token_counts in sorted(counts.items()):
      output_file.write(token + " " + token_counts["counts1"] + " " + token_counts["counts2"] + "\n")
      # TODO: also write out those floats in the same line

这是步骤 3 的一些 Python 代码：

# assume 'token' contains the token you want to find
shard_id = hash(token) % NUM_SHARDS
filename = "file" + str(shard_id) + ".index"
binary_search(token, open(filename), 0, os.path.getsize(filename))

# print out the line in 'file' whose first token is 'token'
# begin/end always point to the start of a line
def binary_search(token, file, begin, end):
    # If we're close, just do brute force
    if end - begin < 10000:
            file.seek(begin)
            while file.tell() < end:
                    line = file.readline()
                    cur_token = line.strip().split(" ")[0]
                    if cur_token == token:
                            print line
                            return True
            return False  # not found

    # If we're not close, pivot based on a line near the middle
    file.seek((begin + end) / 2)
    partial_line = file.readline()  # ignore the first fractional line
    line = file.readline()

    cur_token = line.strip().split(" ")[0]
    if cur_token == token:
            print line
            return True
    elif cur_token < token:
            return binary_search(token, file, file.tell(), end)
    else:  # cur_token > token
            return binary_search(token, file, begin, file.tell() - len(line))

High-level solution:

Parse through the input, outputting "[token] +X +Y" lines to 1-of-N output files (Each of these "sharded" output files is small enough to be processed in-memory.)
[For each file] read it into memory, output a sorted file with "[token] [count1] [count2] ..." lines
At query time, do a binary search on the correct file

Details:
Here is Python pseudo-code for step 1)

NUM_SHARDS = 1000  # big enough to make each file fit in memory  
output_files = [open("file" + str(n), "w") for n in xrange(NUM_SHARDS)]
for token in input_stream:
   shard_id = hash(token) % NUM_SHARDS
   output_files[shard_id].write(token + " +0 +1\n")
   # TODO: output the correct +X and +Y as needed

Here is Python pseudo-code for step 2)

input_files = [open("file" + str(n)) for n in xrange(NUM_SHARDS)]
for file in input_files:
   counts = {}   # Key: token   Value: { "count1": 0, "count2": 1 }

   # read the file, and populate 'counts'
   for line in file:
      (token, count1, count2) = line.split(" ")
      # make sure we have a value for this token
      counts.setdefault(token, { "count1": 0, "count2": 0 })
      counts[token]["count1"] += int(count1)
      counts[token]["count2"] += int(count2)
      # TODO: compute those floats, and stuff those inside 'counts' also

   # now write 'counts' out to a file (in sorted order)
   output_file = open(file.name + ".index", "w")
   for token, token_counts in sorted(counts.items()):
      output_file.write(token + " " + token_counts["counts1"] + " " + token_counts["counts2"] + "\n")
      # TODO: also write out those floats in the same line

Here is some Python code for step 3):

# assume 'token' contains the token you want to find
shard_id = hash(token) % NUM_SHARDS
filename = "file" + str(shard_id) + ".index"
binary_search(token, open(filename), 0, os.path.getsize(filename))

# print out the line in 'file' whose first token is 'token'
# begin/end always point to the start of a line
def binary_search(token, file, begin, end):
    # If we're close, just do brute force
    if end - begin < 10000:
            file.seek(begin)
            while file.tell() < end:
                    line = file.readline()
                    cur_token = line.strip().split(" ")[0]
                    if cur_token == token:
                            print line
                            return True
            return False  # not found

    # If we're not close, pivot based on a line near the middle
    file.seek((begin + end) / 2)
    partial_line = file.readline()  # ignore the first fractional line
    line = file.readline()

    cur_token = line.strip().split(" ")[0]
    if cur_token == token:
            print line
            return True
    elif cur_token < token:
            return binary_search(token, file, file.tell(), end)
    else:  # cur_token > token
            return binary_search(token, file, begin, file.tell() - len(line))

回复收藏 0 原文

护你周全 2024-10-03 15:23:50

好吧，如果 MongoDB 和 CouchDB 不适合您，那么您基本上会遇到一个问题：功能不足。

我们来看看洗衣清单：

它必须扩展到 O(10^8) 个令牌。

你有多少内存？您正在谈论数亿个令牌，并且您正在谈论流式传输 7zip 文件。如果你想快速发出“增量”，你需要能够将整个数据结构保存在内存中，否则整个过程会非常慢。

需要非常快地查询最终结果！

多快？微秒、毫秒、数百毫秒？如果你想在一台 8GB RAM 的机器上查询 500M 的记录，你就很困难了。无论您使用什么数据库，数据都不适合。

数据集> 2TB

好的，我们假设您的计算机平均可以实现约 50MB/秒的持续吞吐量并且您的进程实际上可以以该速度解压缩数据。按照这个速度，您所说的处理时间只是为了流式传输数据（您希望在周末完成？）

11 小时 50MB/s 的吞吐量可不是小菜一碟，这是真正的驱动器。如果您尝试在发生这种情况（或操作系统交换）时向磁盘写入任何内容，那么情况会很快恶化。

从数据库的角度来看，MongoDB既可以处理前端更新，也可以处理后端查询。但它需要每分钟左右刷新到磁盘，这将显着延长 11 小时的运行时间。

除非您能够处理内存中的整个数据库和内存中的整个流，否则总运行时间只会变得越来越糟。

我的观点...

非常简单，你需要更多的力量。

如果您没有使用 24GB 以上的 RAM 来运行此操作，那么您所做的一切都会感觉很慢。如果您没有 24GB 以上的 RAM，那么您的最终数据集不会“闪电般快速”，最多只能是“200 毫秒快速”。您只能索引 500M 行并期望找到一个条目，除非您可以在 RAM 中保留索引。

如果您没有使用出色的 HDD 来运行此操作，那么操作会显得很慢。我的意思是，您正在谈论数小时的高吞吐量持续读取（也可能是写入）。

我知道你需要帮助，我知道你已经对这个问题悬赏了，但是解决以下问题确实很难：

我一直在尝试 CouchDB 和 MongoDB，但没有取得太好的结果。

当听起来你还没有真正找到合适的装备来解决问题时。

OK, if MongoDB and CouchDB don't work for you, then you basically have one problem: not enough power.

Let's look at the laundry list:

It must scale up to O(10^8) tokens.

How much RAM do you have? You're talking about hundreds of millions of tokens and you're talking about streaming out a 7zip file. If you want to issue "increments" quickly, you need to be able to keep the entire data structure in memory or the whole thing will go very slowly.

The final result needs to be queried very fast!

How fast? Microseconds, Millisecond, hundreds of Milliseconds? If you want to query into 500M records on a machine with 8GB of RAM, you're pretty much hooped. The data just won't fit, doesn't matter what DB you're using.

Dataset > 2Tb

OK, let's assume that your computer can average about 50MB / second of sustained throughput and that your proc can actually decompress data at that pace. At that pace you're talking about 11+ hours of processing time just to stream the data (you wanted this done in a weekend?)

50MB/s of throughput for 11 hours is not small potatoes, that's a real drive. And if you try to write anything to the disk while that's happening (or the OS swaps), then that is going to degrade fast.

Look from a DB perspective, MongoDB can handle both the front-end updating and the back-end querying. But it needs to flush to disk every minute or so and that's going to significantly extend your 11-hour run time.

That total run-time is just going to get worse and worse unless you can handle the entire DB in memory and the entire stream in memory.

My point...

is pretty simple, you need more power.

If you're not running this operation with 24GB+ of RAM, then everything you do is going to feel slow. If you don't have 24GB+ of RAM, then your final dataset is not going to be "lightning-quick", at best it will be "200 ms-quick". You just can index 500M rows and expect to find an entry unless you can keep an index in RAM.

If you're not running this operation with awesome HDDs, then the opration is going to seem slow. I mean, you're talking about hours and hours of high-throughput sustained reads (and probably writes).

I know that you want help, I know that you've put a bounty on this question, but it's really hard to fix the following problem: