在Java中对大文件进行排序

发布于 2024-08-24 02:16:24 字数 1723 浏览 5 评论 0原文

我有一个文件,由一行组成:

 1 , 1 2 , 1 3 6 , 4 ,...

在此表示中,空格分隔整数和逗号。 这个字符串太大了,我无法使用 RandomAccessFile.readLine() 读取它(几乎需要 4 GB)。这样我就创建了一个缓冲区,它可以包含 10 个整数。我的任务是对字符串中的所有整数进行排序。

你能帮忙吗?

编辑

@Oscar Reyes

我需要将一些整数序列写入文件,然后从中读取。其实我也不知道,该怎么办。我是新手。所以我决定用chars来写整数,整数之间的分隔符是“,”,序列之间的分隔符是“\n\r”。所以我创造了一个可以读取它的怪物:

public BinaryRow getFilledBuffer(String filePath, long offset) throws IOException{
    mainFile = new RandomAccessFile(filePath, "r");

    if (mainFile.length() == 0){
        return new BinaryRow();
    }

    StringBuilder str = new StringBuilder();

    mainFile.seek(mainFile.length()-4); //that is "\n" symbol
    char chN = mainFile.readChar();

    mainFile.seek(offset);
    int i = 0;
    char nextChar = mainFile.readChar();
    while (i < 11 && nextChar != chN){
        str.append(nextChar);
        if (nextChar == ','){
            i++;
            if (i == 10){
                break;
            }
        }
        nextChar = mainFile.readChar();
    }

    if (nextChar == chN){
        position = -1;
    }else{
        position = mainFile.getFilePointer();
    }

    BinaryRow br = new BinaryRow();

    StringBuilder temp = new StringBuilder();

    for (int j = 0; j < str.length(); j++){
        if ((str.charAt(j) != ',')){
            temp.append(str.charAt(j));
            if (j == str.length() - 1){
                br.add(Integer.parseInt(temp.toString()));
            }   
        }else{
            br.add(Integer.parseInt(temp.toString()));
            temp.delete(0, temp.length());
        }
    }


    mainFile.close();
    return br;

}

如果你能建议如何去做,请去做 =)

I have a file, which consists of a one row:

 1 , 1 2 , 1 3 6 , 4 ,...

In this representation, spaces separate the integers and commas.
This string is so huge that I can't read it with RandomAccessFile.readLine() (almost 4 Gb needed). So that I created a buffer, which can contain 10 integers. My task is to sort all integers in the string.

Could you, please, help?

EDIT

@Oscar Reyes

I need to write some sequences of integers to a file and then to read from it. Actually I don't know, how to do it. I'm a newbie. So I decided to use chars to write integers, delimiters between integers are ",", and delimeters between sequences are "\n\r" which. So that I created a monster that reads it:

public BinaryRow getFilledBuffer(String filePath, long offset) throws IOException{
    mainFile = new RandomAccessFile(filePath, "r");

    if (mainFile.length() == 0){
        return new BinaryRow();
    }

    StringBuilder str = new StringBuilder();

    mainFile.seek(mainFile.length()-4); //that is "\n" symbol
    char chN = mainFile.readChar();

    mainFile.seek(offset);
    int i = 0;
    char nextChar = mainFile.readChar();
    while (i < 11 && nextChar != chN){
        str.append(nextChar);
        if (nextChar == ','){
            i++;
            if (i == 10){
                break;
            }
        }
        nextChar = mainFile.readChar();
    }

    if (nextChar == chN){
        position = -1;
    }else{
        position = mainFile.getFilePointer();
    }

    BinaryRow br = new BinaryRow();

    StringBuilder temp = new StringBuilder();

    for (int j = 0; j < str.length(); j++){
        if ((str.charAt(j) != ',')){
            temp.append(str.charAt(j));
            if (j == str.length() - 1){
                br.add(Integer.parseInt(temp.toString()));
            }   
        }else{
            br.add(Integer.parseInt(temp.toString()));
            temp.delete(0, temp.length());
        }
    }


    mainFile.close();
    return br;

}

If you could advise how to do it, please do it =)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏末染殇 2024-08-31 02:16:25

这正是 QuickSort 的起源,当时没有足够的 RAM 来在内存中排序,所以他们的程序是将部分结果存储在磁盘中。

所以你能做的就是:

  1. 选择一个支点。
  2. 按顺序读取文件,并将低于数据透视表的数据存储在 temp_file_1 中,将大于或等于数据透视表的数据存储在 temp_file_2 中 在
  3. temp_file_1 中重复该过程,并将结果附加到 result_file
  4. 对 temp_file_2 重复该过程,并将结果附加到 result_file

当部分足够小时 ( 像2一样直接交换它们足以在内存中排序)

这样你就可以分块排序并将部分结果存储在临时文件中,你将得到一个包含结果的最终文件已排序。

编辑 我告诉过你可以进行快速排序。

毕竟,您似乎需要一些额外的空间来存放临时文件。

这就是我所做的。

我创建了一个 40 MB 的文件,其中的数字以逗号分隔。

我将其命名为input

input http://img200.imageshack。 us/img200/5129/capturadepantalla201003t.png

输入为 40mb

在排序过程中,会创建包含“大于”、“小于”值的 tmp 文件,并且在排序时完成后,这些值将发送到一个名为(猜猜是什么)output

处理 http://img200.imageshack.us/img200/1672/capturadepantalla201003y.png

使用部分结果创建临时文件

最后所有 tmp 文件都被删除,结果是保存在文件“输出”中,并具有正确排序的数字序列:

输出 http://img203 .imageshack.us/img203/5950/capturadepantalla201003w.png

最后创建了“输出”文件,注意它也是 40 mb

这是完整的程序。

import java.io.*;
import java.util.*;

public class FileQuickSort {

    static final int MAX_SIZE = 1024*1024*16; // 16 megabytes in this sample, the more memory your program has, less disk writing will be used. 
    public static void main( String [] args ) throws IOException {
        fileQuickSort( new File("input"), new File("output"));
        System.out.println();
    }

    //
    static void fileQuickSort( File inputFile, File outputFile ) throws IOException {
        Scanner scanner = new Scanner( new BufferedInputStream( new FileInputStream( inputFile ), MAX_SIZE));
        scanner.useDelimiter(",");

        if( inputFile.length() > MAX_SIZE && scanner.hasNextInt()) {
            System.out.print("-");

            // put them in two buckets... 
            File lowerFile = File.createTempFile("quicksort-","-lower.tmp",new File("."));
            File greaterFile = File.createTempFile("quicksort-","-greater.tmp", new File("."));
            PrintStream  lower   = createPrintStream(lowerFile);
            PrintStream greater  = createPrintStream(greaterFile);
            PrintStream target = null;
            int pivot = scanner.nextInt();

            // Read the file and put the values greater than in a file 
            // and the values lower than in other 
            while( scanner.hasNextInt() ){
                int current = scanner.nextInt();

                if( current < pivot ){
                    target = lower;
                } else {
                    target = greater;
                }
                target.printf("%d,",current);
            }
            // avoid dropping the pivot
            greater.printf("%d,",pivot);
            // close the stream before reading them again
            scanner.close();
            lower.close();
            greater.close();
            // sort each part
            fileQuickSort( lowerFile , outputFile );
            lowerFile.delete();
            fileQuickSort( greaterFile   , outputFile);
            greaterFile.delete();

            // And you're done.
        } else {

            // Else , if you have enough RAM to process it
            // 
            System.out.print(".");
            List<Integer> smallFileIntegers = new ArrayList<Integer>();
            // Read it
            while( scanner.hasNextInt() ){
                smallFileIntegers.add( scanner.nextInt() );
            }
            scanner.close();

            // Sort them in memory 
            Collections.sort( smallFileIntegers );

            PrintStream out = createPrintStream( outputFile);
            for( int i : smallFileIntegers ) {
                out.printf("%d,",i);
            }
            out.close();
            // And your're done
        }
    }
    private static PrintStream createPrintStream( File file ) throws IOException {
        boolean append = true;
        return new PrintStream(  new BufferedOutputStream( new FileOutputStream( file, append )));
    }
}

文件的格式是 number,number,number,number

您当前的格式是:number , numb , be r

要解决此问题,您只需全部阅读并跳过空白。

为此添加另一个问题。

This is exactly the origin QuickSort back then there was not enough RAM to sort in memory so they procedure is to store partial results in disk.

So what you can do is:

  1. Pick a pivot.
  2. Read sequentially your file and store data lower than pivot in temp_file_1 and data bigger or equal to the pivot in temp_file_2
  3. Repeat the procedure in temp_file_1 and append the result to result_file
  4. Repeat the procedure for temp_file_2 and append the result to result_file

When parts are small enough ( like 2 just direct swap them Enough to be sorted in memory )

This way you'll be able to sort in chunks and store the partial results in temp files and you'll have a final file with the result sorted.

EDIT I told you a quick sort was possible.

It seems like you would need some extra space for the temp files after all.

Here's what I did.

I create a 40 mb file with numbers separated by commas.

I name it input:

input http://img200.imageshack.us/img200/5129/capturadepantalla201003t.png

Input is 40mb

During the sort, the tmp files with the buckets of "greater than", "lower than" values are created and when the sort is finished, the values are sent to a file called ( guess what ) output

processing http://img200.imageshack.us/img200/1672/capturadepantalla201003y.png

Temp files are created with the partial results

Finally all the tmp files are deleted and the result is kept in the file "output" with the correct sorted sequence of numbers:

output http://img203.imageshack.us/img203/5950/capturadepantalla201003w.png

Finally the file "output" is created, notice it is 40 mb too

Here's the full program.

import java.io.*;
import java.util.*;

public class FileQuickSort {

    static final int MAX_SIZE = 1024*1024*16; // 16 megabytes in this sample, the more memory your program has, less disk writing will be used. 
    public static void main( String [] args ) throws IOException {
        fileQuickSort( new File("input"), new File("output"));
        System.out.println();
    }

    //
    static void fileQuickSort( File inputFile, File outputFile ) throws IOException {
        Scanner scanner = new Scanner( new BufferedInputStream( new FileInputStream( inputFile ), MAX_SIZE));
        scanner.useDelimiter(",");

        if( inputFile.length() > MAX_SIZE && scanner.hasNextInt()) {
            System.out.print("-");

            // put them in two buckets... 
            File lowerFile = File.createTempFile("quicksort-","-lower.tmp",new File("."));
            File greaterFile = File.createTempFile("quicksort-","-greater.tmp", new File("."));
            PrintStream  lower   = createPrintStream(lowerFile);
            PrintStream greater  = createPrintStream(greaterFile);
            PrintStream target = null;
            int pivot = scanner.nextInt();

            // Read the file and put the values greater than in a file 
            // and the values lower than in other 
            while( scanner.hasNextInt() ){
                int current = scanner.nextInt();

                if( current < pivot ){
                    target = lower;
                } else {
                    target = greater;
                }
                target.printf("%d,",current);
            }
            // avoid dropping the pivot
            greater.printf("%d,",pivot);
            // close the stream before reading them again
            scanner.close();
            lower.close();
            greater.close();
            // sort each part
            fileQuickSort( lowerFile , outputFile );
            lowerFile.delete();
            fileQuickSort( greaterFile   , outputFile);
            greaterFile.delete();

            // And you're done.
        } else {

            // Else , if you have enough RAM to process it
            // 
            System.out.print(".");
            List<Integer> smallFileIntegers = new ArrayList<Integer>();
            // Read it
            while( scanner.hasNextInt() ){
                smallFileIntegers.add( scanner.nextInt() );
            }
            scanner.close();

            // Sort them in memory 
            Collections.sort( smallFileIntegers );

            PrintStream out = createPrintStream( outputFile);
            for( int i : smallFileIntegers ) {
                out.printf("%d,",i);
            }
            out.close();
            // And your're done
        }
    }
    private static PrintStream createPrintStream( File file ) throws IOException {
        boolean append = true;
        return new PrintStream(  new BufferedOutputStream( new FileOutputStream( file, append )));
    }
}

The format of the files is number,number,number,number

Your current format is: n u m b e r , n u m b , b e r

To fix that you just have to read it all and skip the blanks.

Add another question for that.

对你而言 2024-08-31 02:16:25

将其分块读取到内存(每个块 100 MB?),一次一个块,对其进行排序并保存到磁盘。

然后打开所有有序块,读取每个块的第一个元素,并将最低的元素附加到输出。然后读取刚刚读取的块的下一个元素并重复。

合并时,您可以保留从每个块读取的最后一个 int 的数组,然后迭代它以获得最低的值。然后,将刚刚使用的值替换为从中获取的块中的下一个元素。

example with chunks [1, 5, 16] [2, 9, 14] [3, 8, 10]
array [(1), 2, 3], lowest 1 --> to output
      [5, (2), 3], lowest 2 --> to output
      [5, 9, (3)], lowest 3 -->
      [(5), 9, 8],        5
      [16, 9, (8)],       8
      [16, (9), 10],      9 
...

Read it to memory in chunks (100 MB each?), one chunk at a time, sort it and save to disk.

Then open all the ordered chunks, read the first element of each, and append the lowest to the output. Then read the next element of the chunk you just read from and repeat.

When merging you can keep an array of the last int read from each chunk and just iterate over it to get the lowest. Then you substitute the value you just used with the next element in the chunk it was taken from.

example with chunks [1, 5, 16] [2, 9, 14] [3, 8, 10]
array [(1), 2, 3], lowest 1 --> to output
      [5, (2), 3], lowest 2 --> to output
      [5, 9, (3)], lowest 3 -->
      [(5), 9, 8],        5
      [16, 9, (8)],       8
      [16, (9), 10],      9 
...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文