Java中大型数据集基于文件的合并排序

发布于 2024-11-14 17:47:20 字数 70 浏览 3 评论 0原文

给定的大数据集不适合内存,是否有任何库或 API 可以在 Java 中执行排序? 该实现可能类似于 Linux 实用程序排序。

given large datasets that don't fit in memory, is there any library or api to perform sort in Java?
the implementation would possibly be similar to linux utility sort.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

多彩岁月 2024-11-21 17:47:20

Java 提供了一个通用排序例程,可以将其用作更大的问题解决方案的一部分。对太大而无法装入内存的数据进行排序的常见方法是:

1) 读取主内存中能够容纳的尽可能多的数据,假设它是 1 Gb

2) 对 1 Gb 进行快速排序(这里是您使用 Java 内置的方法) -从集合框架排序)

3) 将排序后的 1 Gb 作为“chunk-1”写入磁盘

4) 重复步骤 1-3,直到遍历完所有数据,保存每个数据块在一个单独的文件中。因此,如果您的原始数据为 9 GB,您现在将拥有 9 个已排序的数据块,标记为“chunk-1”至“chunk-9”

5) 您现在只需要最终的合并排序即可将 9 个已排序的数据块合并为一个完整的数据块排序的数据集。合并排序将非常有效地处理这些预排序的块。它本质上会打开 9 个文件读取器(每个块一个),加上一个文件写入器(用于输出)。然后,它比较每个读取文件中的第一个数据元素并选择最小值,并将其写入输出文件。发出所选值的读取器前进到下一个数据元素,并重复 9 路比较过程以找到最小值,再次将答案写入输出文件。重复此过程,直到从所有块文件中读取所有数据。

6) 一旦步骤 5 完成读取所有数据,您的输出文件现在包含一个完全排序的数据集

通过这种方法,您可以轻松编写自己的通用“megasort”实用程序,该实用程序采用文件名和 maxMemory 参数,并且使用临时文件有效地对文件进行排序。我敢打赌,您至少可以找到一些实现,但如果没有,您也可以按照上面的描述推出自己的实现。

Java provides a general-purpose sorting routine which can be used as part of the larger solution to your problem. A common approach to sort data that's too large to all fit in memory is this:

1) Read as much data as will fit into main memory, let's say it's 1 Gb

2) Quicksort that 1 Gb (here's where you'd use Java's built-in sort from the Collections framework)

3) Write that sorted 1 Gb to disk as "chunk-1"

4) Repeat steps 1-3 until you've gone through all the data, saving each data chunk in a separate file. So if your original data was 9 Gb, you will now have 9 sorted chunks of data labeled "chunk-1" thru "chunk-9"

5) You now just need a final merge sort to merge the 9 sorted chunks into a single fully sorted data set. The merge sort will work very efficiently against these pre-sorted chunks. It will essentially open 9 file readers (one for each chunk), plus one file writer (for output). It then compares the first data element in each read file and selects the smallest value, which is written to the output file. The reader from which that selected value came advances to its next data element, and the 9-way comparison process to find the smallest value is repeated, again writing the answer to the output file. This process repeats until all data has been read from all the chunk files.

6) Once step 5 has finished reading all the data you are done -- your output file now contains a fully sorted data set

With this approach you could easily write a generic "megasort" utility of your own that takes a filename and maxMemory parameter and efficiently sorts the file by using temp files. I'd bet you could find at least a few implementations out there for this, but if not you can just roll your own as described above.

飘过的浮云 2024-11-21 17:47:20

处理大型数据集最常见的方法是在内存中(现在您可以购买 1 TB 的服务器)或在数据库中。

如果您不打算使用数据库(或购买更多内存),您可以轻松地自己编写它。

有些库可能有助于执行 Map-Reduce 函数,但它们可能会增加比节省的复杂性更多的复杂性。

The most common way to handle large datasets is in memory (you can buy a server with 1 TB these days) or in a database.

If you are not going to use a database (or buy more memory) you can write it yourself fair easily.

There are libraries which may help which perform Map-Reduce functions but they may add more complexity than they save.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文