如何对非常大的文件进行排序

发布于 2024-12-12 00:26:58 字数 309 浏览 0 评论 0原文

我有一些文件应该根据每行开头的 id 进行排序。 文件大小约为 2-3 GB。

我尝试将所有数据读入 ArrayList 并对其进行排序。但内存不足以保留它们全部。它不起作用。

线条看起来像

0052304 0000004000000000000000000000000000000041 约翰·泰迪 000023
0022024 0000004000000000000000000000000000000041 乔治家族 00013

我怎样才能对文件进行排序?

I have some files that should be sorted according to id at the beginning of each line.
The files are about 2-3 gb.

I tried to read all data into an ArrayList and sort them. But memory is not enough to keep them all. It does not work.

Lines look like

0052304 0000004000000000000000000000000000000041 John Teddy 000023
0022024 0000004000000000000000000000000000000041 George Clan 00013

How can I sort the files??

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

慕烟庭风 2024-12-19 00:26:58

这不完全是 Java 问题。您需要研究一种有效的算法来对未完全读入内存的数据进行排序。对归并排序进行一些修改可以实现这一点。

看看这个:
http://en.wikipedia.org/wiki/Merge_sort

和:
http://en.wikipedia.org/wiki/External_sorting

基本上,这里的想法是打破将文件分成较小的部分,对它们进行排序(使用合并排序或其他方法),然后使用合并排序中的合并来创建新的排序文件。

That isn't exactly a Java problem. You need to look into an efficient algorithm for sorting data that isn't completely read into memory. A few adaptations to Merge-Sort can achieve this.

Take a look at this:
http://en.wikipedia.org/wiki/Merge_sort

and:
http://en.wikipedia.org/wiki/External_sorting

Basically the idea here is to break the file into smaller pieces, sort them (either with merge sort or another method), and then use the Merge from merge-sort to create the new, sorted file.

奶气 2024-12-19 00:26:58

由于您的记录已经采用平面文件文本格式,因此您可以将它们通过管道传输到 UNIX sort(1) 例如 sort -n -t' ' -k1,1 sort -n -t' ' -k1,1 sort -n -t' ' -k1,1 sort -n -t' ' -k1,1 sort输入>输出。它将自动对数据进行分块,并使用可用内存和 /tmp 执行合并排序。如果您需要的空间多于可用内存,请在命令中添加 -T /tmpdir

很有趣的是,当您可以使用在每个平台上都可用且已存在数十年的工具时,每个人都告诉您下载巨大的 C# 或 Java 库或自己实现合并排序。

Since your records are already in flat file text format, you can pipe them into UNIX sort(1) e.g. sort -n -t' ' -k1,1 < input > output. It will automatically chunk the data and perform merge sort using available memory and /tmp. If you need more space than you have memory available, add -T /tmpdir to the command.

It's quite funny that everyone is telling you to download huge C# or Java libraries or implement merge-sort yourself when you can use a tool that is available on every platform and has been around for decades.

徒留西风 2024-12-19 00:26:58

您需要一个外部合并排序来做到这一点。 这里是它的一个 Java 实现,可以对非常大的文件进行排序。

You need an external merge sort to do that. Here is a Java implementation of it that sorts very large files.

征﹌骨岁月お 2024-12-19 00:26:58

您可以只读取行开始位置的键和索引(也可能还有长度),而不是将所有数据一次加载到内存中,例如,

class Line {
   int key, length;
   long start;
}

这将每行使用大约 40 个字节。

对该数组进行排序后,您可以使用 RandomAccessFile 按行出现的顺序读取行。

注意:由于您将随机访问磁盘,而不是使用内存,因此速度可能会非常慢。典型的磁盘需要 8 毫秒来随机访问数据,如果有 1000 万行,则大约需要一天的时间。 (这绝对是最坏的情况)在内存中大约需要 10 秒。

Instead of loading all the data into memory at once, you could read just the keys and an index to where the line starts (and possibly the length as well) e.g.

class Line {
   int key, length;
   long start;
}

This would use about 40 bytes per line.

Once you have sorted this array, you can use RandomAccessFile to read the lines in the order they appear.

Note: since you will be randomly hitting the disk, instead of using memory this could be very slow. A typical disk takes 8 ms to randomly access data and if you have 10 million lines this will take about one day. (This is absolute worst case) In memory it would take about 10 seconds.

那小子欠揍 2024-12-19 00:26:58

您需要执行外部排序。这是 Hadoop/MapReduce 背后的驱动思想,只是它不考虑分布式集群并在单个节点上工作。

为了获得更好的性能,您应该使用 Hadoop/Spark。

输入图片此处描述
根据您的系统更改此行。 fpath 是您的一个大输入文件(使用 20GB 进行测试)。 shared 路径是存储执行日志的位置。 fdir 是存储和合并中间文件的位置。根据您的机器更改这些路径。

public static final String fdir = "/tmp/";
    public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
    public static final String fPath = "/input/data-20GB.in";
    public static final String opLog = shared+"Mysort20GB.log";

然后运行以下程序。最终排序的文件将在 fdir 路径中创建,名称为 op401。最后一行 Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); 检查输出是否已排序或不是 。如果您没有安装 valsort 或者输入文件不是使用 gensort(http://www .ordinal.com/gensort.html) 。

另外,不要忘记将 int totalLines = 200000000; 更改为文件中的总行数。线程计数 (int threadCount = 16) 应始终为 2 的幂且足够大,以便(总大小 * 2 / 线程数)数据量可以驻留在内存中。更改线程数将更改最终输出文件的名称。对于 16,它将是 op401,对于 32,它将是 op501,对于 8,它将是 op301 等等。

享受吧。

    import java.io.*;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.util.ArrayList;
    import java.util.Comparator;
    import java.util.stream.Stream;


    class SplitFile extends Thread {
        String fileName;
        int startLine, endLine;

        SplitFile(String fileName, int startLine, int endLine) {
            this.fileName = fileName;
            this.startLine = startLine;
            this.endLine = endLine;
        }

        public static void writeToFile(BufferedWriter writer, String line) {
            try {
                writer.write(line + "\r\n");
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }

        public void run() {
            try {
                BufferedWriter writer = Files.newBufferedWriter(Paths.get(fileName));
                int totalLines = endLine + 1 - startLine;
                Stream<String> chunks =
                        Files.lines(Paths.get(Mysort20GB.fPath))
                                .skip(startLine - 1)
                                .limit(totalLines)
                                .sorted(Comparator.naturalOrder());

                chunks.forEach(line -> {
                    writeToFile(writer, line);
                });
                System.out.println(" Done Writing " + Thread.currentThread().getName());
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    class MergeFiles extends Thread {
        String file1, file2, file3;
        MergeFiles(String file1, String file2, String file3) {
            this.file1 = file1;
            this.file2 = file2;
            this.file3 = file3;
        }

        public void run() {
            try {
                System.out.println(file1 + " Started Merging " + file2 );
                FileReader fileReader1 = new FileReader(file1);
                FileReader fileReader2 = new FileReader(file2);
                FileWriter writer = new FileWriter(file3);
                BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
                BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
                String line1 = bufferedReader1.readLine();
                String line2 = bufferedReader2.readLine();
                //Merge 2 files based on which string is greater.
                while (line1 != null || line2 != null) {
                    if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
                        writer.write(line2 + "\r\n");
                        line2 = bufferedReader2.readLine();
                    } else {
                        writer.write(line1 + "\r\n");
                        line1 = bufferedReader1.readLine();
                    }
                }
                System.out.println(file1 + " Done Merging " + file2 );
                new File(file1).delete();
                new File(file2).delete();
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    public class Mysort20GB {
        //public static final String fdir = "/Users/diesel/Desktop/";
        public static final String fdir = "/tmp/";
        public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
        public static final String fPath = "/input/data-20GB.in";
        public static final String opLog = shared+"Mysort20GB.log";

        public static void main(String[] args) throws Exception{
            long startTime = System.nanoTime();
            int threadCount = 16; // Number of threads
            int totalLines = 200000000;
            int linesPerFile = totalLines / threadCount;
            ArrayList<Thread> activeThreads = new ArrayList<Thread>();

            for (int i = 1; i <= threadCount; i++) {
                int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
                int endLine = i * linesPerFile;
                SplitFile mapThreads = new SplitFile(fdir + "op" + i, startLine, endLine);
                activeThreads.add(mapThreads);
                mapThreads.start();
            }
            activeThreads.stream().forEach(t -> {
                try {
                    t.join();
                } catch (Exception e) {
                }
            });

            int treeHeight = (int) (Math.log(threadCount) / Math.log(2));

            for (int i = 0; i < treeHeight; i++) {
                ArrayList<Thread> actvThreads = new ArrayList<Thread>();

for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
                    int offset = i * 100;
                    String tempFile1 = fdir + "op" + (j + offset);
                    String tempFile2 = fdir + "op" + ((j + 1) + offset);
                    String opFile = fdir + "op" + (itr + ((i + 1) * 100));

                    MergeFiles reduceThreads =
                            new MergeFiles(tempFile1,tempFile2,opFile);
                    actvThreads.add(reduceThreads);
                    reduceThreads.start();
                }
                actvThreads.stream().forEach(t -> {
                    try {
                        t.join();
                    } catch (Exception e) {
                    }
                });
            }
            long endTime = System.nanoTime();
            double timeTaken = (endTime - startTime)/1e9;
            System.out.println(timeTaken);
            BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
            logFile.write("Time Taken in seconds:" + timeTaken);
            Runtime.getRuntime().exec("valsort  " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
            logFile.close();
        }
    }

You need to perform an external sort. It's kind the driving idea behind Hadoop/MapReduce , just that it doesn't take distributed cluster into account and works on a single node.

For better performance, you should use Hadoop/Spark.

enter image description here
Change this lines according to your system . fpath is your one big input file (tested with 20GB). shared path is where the execution log is stored. fdir is where the intermediate files will be stored and merged. Change these paths according to your machine.

public static final String fdir = "/tmp/";
    public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
    public static final String fPath = "/input/data-20GB.in";
    public static final String opLog = shared+"Mysort20GB.log";

Then run the following programme. Your final sorted file will be created with the name op401 in fdir path. the last line Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); checks the output is sorted or not . Remove this line if you dont have valsort installed or the input file is not generated using gensort(http://www.ordinal.com/gensort.html) .

Also dont forget to change int totalLines = 200000000; to the total lines in your file. and thread count (int threadCount = 16) should be always in power of 2 and large enough so that (total size * 2 / no of thread) amount of data can reside in memory. Changing Thread count will change the name of final output file. Like for 16, it will be op401, for 32 it will be op501, for 8 it will be op301 etc.

Enjoy.

    import java.io.*;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.util.ArrayList;
    import java.util.Comparator;
    import java.util.stream.Stream;


    class SplitFile extends Thread {
        String fileName;
        int startLine, endLine;

        SplitFile(String fileName, int startLine, int endLine) {
            this.fileName = fileName;
            this.startLine = startLine;
            this.endLine = endLine;
        }

        public static void writeToFile(BufferedWriter writer, String line) {
            try {
                writer.write(line + "\r\n");
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }

        public void run() {
            try {
                BufferedWriter writer = Files.newBufferedWriter(Paths.get(fileName));
                int totalLines = endLine + 1 - startLine;
                Stream<String> chunks =
                        Files.lines(Paths.get(Mysort20GB.fPath))
                                .skip(startLine - 1)
                                .limit(totalLines)
                                .sorted(Comparator.naturalOrder());

                chunks.forEach(line -> {
                    writeToFile(writer, line);
                });
                System.out.println(" Done Writing " + Thread.currentThread().getName());
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    class MergeFiles extends Thread {
        String file1, file2, file3;
        MergeFiles(String file1, String file2, String file3) {
            this.file1 = file1;
            this.file2 = file2;
            this.file3 = file3;
        }

        public void run() {
            try {
                System.out.println(file1 + " Started Merging " + file2 );
                FileReader fileReader1 = new FileReader(file1);
                FileReader fileReader2 = new FileReader(file2);
                FileWriter writer = new FileWriter(file3);
                BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
                BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
                String line1 = bufferedReader1.readLine();
                String line2 = bufferedReader2.readLine();
                //Merge 2 files based on which string is greater.
                while (line1 != null || line2 != null) {
                    if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
                        writer.write(line2 + "\r\n");
                        line2 = bufferedReader2.readLine();
                    } else {
                        writer.write(line1 + "\r\n");
                        line1 = bufferedReader1.readLine();
                    }
                }
                System.out.println(file1 + " Done Merging " + file2 );
                new File(file1).delete();
                new File(file2).delete();
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    public class Mysort20GB {
        //public static final String fdir = "/Users/diesel/Desktop/";
        public static final String fdir = "/tmp/";
        public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
        public static final String fPath = "/input/data-20GB.in";
        public static final String opLog = shared+"Mysort20GB.log";

        public static void main(String[] args) throws Exception{
            long startTime = System.nanoTime();
            int threadCount = 16; // Number of threads
            int totalLines = 200000000;
            int linesPerFile = totalLines / threadCount;
            ArrayList<Thread> activeThreads = new ArrayList<Thread>();

            for (int i = 1; i <= threadCount; i++) {
                int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
                int endLine = i * linesPerFile;
                SplitFile mapThreads = new SplitFile(fdir + "op" + i, startLine, endLine);
                activeThreads.add(mapThreads);
                mapThreads.start();
            }
            activeThreads.stream().forEach(t -> {
                try {
                    t.join();
                } catch (Exception e) {
                }
            });

            int treeHeight = (int) (Math.log(threadCount) / Math.log(2));

            for (int i = 0; i < treeHeight; i++) {
                ArrayList<Thread> actvThreads = new ArrayList<Thread>();

for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
                    int offset = i * 100;
                    String tempFile1 = fdir + "op" + (j + offset);
                    String tempFile2 = fdir + "op" + ((j + 1) + offset);
                    String opFile = fdir + "op" + (itr + ((i + 1) * 100));

                    MergeFiles reduceThreads =
                            new MergeFiles(tempFile1,tempFile2,opFile);
                    actvThreads.add(reduceThreads);
                    reduceThreads.start();
                }
                actvThreads.stream().forEach(t -> {
                    try {
                        t.join();
                    } catch (Exception e) {
                    }
                });
            }
            long endTime = System.nanoTime();
            double timeTaken = (endTime - startTime)/1e9;
            System.out.println(timeTaken);
            BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
            logFile.write("Time Taken in seconds:" + timeTaken);
            Runtime.getRuntime().exec("valsort  " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
            logFile.close();
        }
    }
心头的小情儿 2024-12-19 00:26:58

使用 java 库 big-sorter 它可用于对非常大的文本或二进制文件进行排序。

以下是您的具体问题的实现方式:

// write the input to a file
String s = "0052304 0000004000000000000000000000000000000041   John Teddy   000023\n"
        + "0022024 0000004000000000000000000000000000000041   George Clan 00013";
File input = new File("target/input");
Files.write(input.toPath(),s.getBytes(StandardCharsets.UTF_8), StandardOpenOption.WRITE);

File output = new File("target/output");


//sort the input
Sorter
    .serializerLinesUtf8()
    .comparator((a,b) -> {
        String ida = a.substring(0, a.indexOf(' '));
        String idb = b.substring(0, b.indexOf(' '));
        return ida.compareTo(idb);
    }) 
    .input(input) 
    .output(output) 
    .sort();

// display the output
Files.readAllLines(output.toPath()).forEach(System.out::println);

输出:

0022024 0000004000000000000000000000000000000041   George Clan 00013
0052304 0000004000000000000000000000000000000041   John Teddy   000023

Use the java library big-sorter which can be used for sorting very large text or binary files.

Here's how your exact problem would be implemented:

// write the input to a file
String s = "0052304 0000004000000000000000000000000000000041   John Teddy   000023\n"
        + "0022024 0000004000000000000000000000000000000041   George Clan 00013";
File input = new File("target/input");
Files.write(input.toPath(),s.getBytes(StandardCharsets.UTF_8), StandardOpenOption.WRITE);

File output = new File("target/output");


//sort the input
Sorter
    .serializerLinesUtf8()
    .comparator((a,b) -> {
        String ida = a.substring(0, a.indexOf(' '));
        String idb = b.substring(0, b.indexOf(' '));
        return ida.compareTo(idb);
    }) 
    .input(input) 
    .output(output) 
    .sort();

// display the output
Files.readAllLines(output.toPath()).forEach(System.out::println);

output:

0022024 0000004000000000000000000000000000000041   George Clan 00013
0052304 0000004000000000000000000000000000000041   John Teddy   000023
峩卟喜欢 2024-12-19 00:26:58

您可以使用 SQL Lite 文件数据库,将数据加载到数据库,然后让它排序并为您返回结果。

优点:无需担心编写最好的排序算法。

缺点:需要磁盘空间,处理速度较慢。

https://sites.google.com/site/arjunwebworld /Home/programming/排序大数据文件

You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.

Advantages: No need to worry about writing the best sorting algorithm.

Disadvantage: You will need disk space, slower processing.

https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

病毒体 2024-12-19 00:26:58

您需要做的是通过流对文件进行分块并单独处理它们。然后您可以将文件合并在一起,因为它们已经排序,这类似于合并排序的工作原理。

这个问题的答案将很有价值: Stream large files

What you need to do is to chunk the files in via a stream and process them separately. Then you can merge the files together as they will already be sorted, this is similar to how merge sort works.

The answer from this SO question will be of value: Stream large files

末蓝 2024-12-19 00:26:58

操作系统带有强大的文件排序实用程序。一个调用 bash 脚本的简单函数应该会有所帮助。

public static void runScript(final Logger log, final String scriptFile) throws IOException, InterruptedException {
    final String command = scriptFile;
    if (!new File (command).exists() || !new File(command).canRead() || !new File(command).canExecute()) {
        log.log(Level.SEVERE, "Cannot find or read " + command);
        log.log(Level.WARNING, "Make sure the file is executable and you have permissions to execute it. Hint: use \"chmod +x filename\" to make it executable");
        throw new IOException("Cannot find or read " + command);
    }
    final int returncode = Runtime.getRuntime().exec(new String[] {"bash", "-c", command}).waitFor();
    if (returncode!=0) {
        log.log(Level.SEVERE, "The script returned an Error with exit code: " + returncode);
        throw new IOException();
    }

}

Operating systems come with powerful file sorting utility. A simple function that calls a bash script should help.

public static void runScript(final Logger log, final String scriptFile) throws IOException, InterruptedException {
    final String command = scriptFile;
    if (!new File (command).exists() || !new File(command).canRead() || !new File(command).canExecute()) {
        log.log(Level.SEVERE, "Cannot find or read " + command);
        log.log(Level.WARNING, "Make sure the file is executable and you have permissions to execute it. Hint: use \"chmod +x filename\" to make it executable");
        throw new IOException("Cannot find or read " + command);
    }
    final int returncode = Runtime.getRuntime().exec(new String[] {"bash", "-c", command}).waitFor();
    if (returncode!=0) {
        log.log(Level.SEVERE, "The script returned an Error with exit code: " + returncode);
        throw new IOException();
    }

}
百善笑为先 2024-12-19 00:26:58

我使用自己的逻辑并按格式对一个大 JSON 文件进行排序

{"name":"hoge.finance","address":"0xfAd45E47083e4607302aa43c65fB3106F1cd7607"}

完整的源代码可在 https://github 上找到.com/sitetester/token-sorter 以及测试用例。代码有很好的文档记录,因此很容易理解。

它将输入文件拆分为多个较小的排序文件(可配置),然后比较数据。

在这里粘贴一些评论...

// at this point, we have sorted data sets in respective files
// next, we will take first token from first file and compare it with tokens of all other files
// during comparison, if some token from other file is in sorted order, then we make it default/initial sorted token
// & jump to next file, since all remaining tokens in THAT file are already in sorted form
// at end of comparisons with all files, we remove it from specific file (so it's not compared next time) and put/append in final sorted file
// this process continues, until all entries are matched
// if some file has no entries, then we simply delete it (so it's not compared next time)

I used my own logic and sorted a BIG JSON file in format

{"name":"hoge.finance","address":"0xfAd45E47083e4607302aa43c65fB3106F1cd7607"}

Full source code is available on https://github.com/sitetester/token-sorter along with test case. Code is well documented, so easy to understand.

It splits input file into multiple smaller SORTED files (configurable) and then compare data.

Pasting some comments here...

// at this point, we have sorted data sets in respective files
// next, we will take first token from first file and compare it with tokens of all other files
// during comparison, if some token from other file is in sorted order, then we make it default/initial sorted token
// & jump to next file, since all remaining tokens in THAT file are already in sorted form
// at end of comparisons with all files, we remove it from specific file (so it's not compared next time) and put/append in final sorted file
// this process continues, until all entries are matched
// if some file has no entries, then we simply delete it (so it's not compared next time)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文