如何对非常大的文件进行排序
我有一些文件应该根据每行开头的 id 进行排序。 文件大小约为 2-3 GB。
我尝试将所有数据读入 ArrayList 并对其进行排序。但内存不足以保留它们全部。它不起作用。
线条看起来像
0052304 0000004000000000000000000000000000000041 约翰·泰迪 000023
0022024 0000004000000000000000000000000000000041 乔治家族 00013
我怎样才能对文件进行排序?
I have some files that should be sorted according to id at the beginning of each line.
The files are about 2-3 gb.
I tried to read all data into an ArrayList
and sort them. But memory is not enough to keep them all. It does not work.
Lines look like
0052304 0000004000000000000000000000000000000041 John Teddy 000023
0022024 0000004000000000000000000000000000000041 George Clan 00013
How can I sort the files??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
这不完全是 Java 问题。您需要研究一种有效的算法来对未完全读入内存的数据进行排序。对归并排序进行一些修改可以实现这一点。
看看这个:
http://en.wikipedia.org/wiki/Merge_sort
和:
http://en.wikipedia.org/wiki/External_sorting
基本上,这里的想法是打破将文件分成较小的部分,对它们进行排序(使用合并排序或其他方法),然后使用合并排序中的合并来创建新的排序文件。
That isn't exactly a Java problem. You need to look into an efficient algorithm for sorting data that isn't completely read into memory. A few adaptations to Merge-Sort can achieve this.
Take a look at this:
http://en.wikipedia.org/wiki/Merge_sort
and:
http://en.wikipedia.org/wiki/External_sorting
Basically the idea here is to break the file into smaller pieces, sort them (either with merge sort or another method), and then use the Merge from merge-sort to create the new, sorted file.
由于您的记录已经采用平面文件文本格式,因此您可以将它们通过管道传输到 UNIX
sort(1)
例如sort -n -t' ' -k1,1
sort -n -t' ' -k1,1
sort -n -t' ' -k1,1
sort -n -t' ' -k1,1
sort输入>输出
。它将自动对数据进行分块,并使用可用内存和/tmp
执行合并排序。如果您需要的空间多于可用内存,请在命令中添加-T /tmpdir
。很有趣的是,当您可以使用在每个平台上都可用且已存在数十年的工具时,每个人都告诉您下载巨大的 C# 或 Java 库或自己实现合并排序。
Since your records are already in flat file text format, you can pipe them into UNIX
sort(1)
e.g.sort -n -t' ' -k1,1 < input > output
. It will automatically chunk the data and perform merge sort using available memory and/tmp
. If you need more space than you have memory available, add-T /tmpdir
to the command.It's quite funny that everyone is telling you to download huge C# or Java libraries or implement merge-sort yourself when you can use a tool that is available on every platform and has been around for decades.
您需要一个外部合并排序来做到这一点。 这里是它的一个 Java 实现,可以对非常大的文件进行排序。
You need an external merge sort to do that. Here is a Java implementation of it that sorts very large files.
您可以只读取行开始位置的键和索引(也可能还有长度),而不是将所有数据一次加载到内存中,例如,
这将每行使用大约 40 个字节。
对该数组进行排序后,您可以使用 RandomAccessFile 按行出现的顺序读取行。
注意:由于您将随机访问磁盘,而不是使用内存,因此速度可能会非常慢。典型的磁盘需要 8 毫秒来随机访问数据,如果有 1000 万行,则大约需要一天的时间。 (这绝对是最坏的情况)在内存中大约需要 10 秒。
Instead of loading all the data into memory at once, you could read just the keys and an index to where the line starts (and possibly the length as well) e.g.
This would use about 40 bytes per line.
Once you have sorted this array, you can use RandomAccessFile to read the lines in the order they appear.
Note: since you will be randomly hitting the disk, instead of using memory this could be very slow. A typical disk takes 8 ms to randomly access data and if you have 10 million lines this will take about one day. (This is absolute worst case) In memory it would take about 10 seconds.
您需要执行外部排序。这是 Hadoop/MapReduce 背后的驱动思想,只是它不考虑分布式集群并在单个节点上工作。
为了获得更好的性能,您应该使用 Hadoop/Spark。
根据您的系统更改此行。
fpath
是您的一个大输入文件(使用 20GB 进行测试)。shared
路径是存储执行日志的位置。fdir
是存储和合并中间文件的位置。根据您的机器更改这些路径。然后运行以下程序。最终排序的文件将在
fdir
路径中创建,名称为 op401。最后一行Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
检查输出是否已排序或不是 。如果您没有安装 valsort 或者输入文件不是使用 gensort(http://www .ordinal.com/gensort.html) 。另外,不要忘记将
int totalLines = 200000000;
更改为文件中的总行数。线程计数 (int threadCount = 16
) 应始终为 2 的幂且足够大,以便(总大小 * 2 / 线程数)数据量可以驻留在内存中。更改线程数将更改最终输出文件的名称。对于 16,它将是 op401,对于 32,它将是 op501,对于 8,它将是 op301 等等。享受吧。
You need to perform an external sort. It's kind the driving idea behind Hadoop/MapReduce , just that it doesn't take distributed cluster into account and works on a single node.
For better performance, you should use Hadoop/Spark.
Change this lines according to your system .
fpath
is your one big input file (tested with 20GB).shared
path is where the execution log is stored.fdir
is where the intermediate files will be stored and merged. Change these paths according to your machine.Then run the following programme. Your final sorted file will be created with the name op401 in
fdir
path. the last lineRuntime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
checks the output is sorted or not . Remove this line if you dont have valsort installed or the input file is not generated using gensort(http://www.ordinal.com/gensort.html) .Also dont forget to change
int totalLines = 200000000;
to the total lines in your file. and thread count (int threadCount = 16
) should be always in power of 2 and large enough so that (total size * 2 / no of thread) amount of data can reside in memory. Changing Thread count will change the name of final output file. Like for 16, it will be op401, for 32 it will be op501, for 8 it will be op301 etc.Enjoy.
使用 java 库 big-sorter 它可用于对非常大的文本或二进制文件进行排序。
以下是您的具体问题的实现方式:
输出:
Use the java library big-sorter which can be used for sorting very large text or binary files.
Here's how your exact problem would be implemented:
output:
您可以使用 SQL Lite 文件数据库,将数据加载到数据库,然后让它排序并为您返回结果。
优点:无需担心编写最好的排序算法。
缺点:需要磁盘空间,处理速度较慢。
https://sites.google.com/site/arjunwebworld /Home/programming/排序大数据文件
You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.
Advantages: No need to worry about writing the best sorting algorithm.
Disadvantage: You will need disk space, slower processing.
https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files
您需要做的是通过流对文件进行分块并单独处理它们。然后您可以将文件合并在一起,因为它们已经排序,这类似于合并排序的工作原理。
这个问题的答案将很有价值: Stream large files
What you need to do is to chunk the files in via a stream and process them separately. Then you can merge the files together as they will already be sorted, this is similar to how merge sort works.
The answer from this SO question will be of value: Stream large files
操作系统带有强大的文件排序实用程序。一个调用 bash 脚本的简单函数应该会有所帮助。
Operating systems come with powerful file sorting utility. A simple function that calls a bash script should help.
我使用自己的逻辑并按格式对一个大 JSON 文件进行排序
完整的源代码可在 https://github 上找到.com/sitetester/token-sorter 以及测试用例。代码有很好的文档记录,因此很容易理解。
它将输入文件拆分为多个较小的排序文件(可配置),然后比较数据。
在这里粘贴一些评论...
I used my own logic and sorted a BIG JSON file in format
Full source code is available on https://github.com/sitetester/token-sorter along with test case. Code is well documented, so easy to understand.
It splits input file into multiple smaller SORTED files (configurable) and then compare data.
Pasting some comments here...