在Java中对大文件进行排序
我有一个文件,由一行组成:
1 , 1 2 , 1 3 6 , 4 ,...
在此表示中,空格分隔整数和逗号。 这个字符串太大了,我无法使用 RandomAccessFile.readLine() 读取它(几乎需要 4 GB)。这样我就创建了一个缓冲区,它可以包含 10 个整数。我的任务是对字符串中的所有整数进行排序。
你能帮忙吗?
编辑
@Oscar Reyes
我需要将一些整数序列写入文件,然后从中读取。其实我也不知道,该怎么办。我是新手。所以我决定用chars来写整数,整数之间的分隔符是“,”,序列之间的分隔符是“\n\r”。所以我创造了一个可以读取它的怪物:
public BinaryRow getFilledBuffer(String filePath, long offset) throws IOException{
mainFile = new RandomAccessFile(filePath, "r");
if (mainFile.length() == 0){
return new BinaryRow();
}
StringBuilder str = new StringBuilder();
mainFile.seek(mainFile.length()-4); //that is "\n" symbol
char chN = mainFile.readChar();
mainFile.seek(offset);
int i = 0;
char nextChar = mainFile.readChar();
while (i < 11 && nextChar != chN){
str.append(nextChar);
if (nextChar == ','){
i++;
if (i == 10){
break;
}
}
nextChar = mainFile.readChar();
}
if (nextChar == chN){
position = -1;
}else{
position = mainFile.getFilePointer();
}
BinaryRow br = new BinaryRow();
StringBuilder temp = new StringBuilder();
for (int j = 0; j < str.length(); j++){
if ((str.charAt(j) != ',')){
temp.append(str.charAt(j));
if (j == str.length() - 1){
br.add(Integer.parseInt(temp.toString()));
}
}else{
br.add(Integer.parseInt(temp.toString()));
temp.delete(0, temp.length());
}
}
mainFile.close();
return br;
}
如果你能建议如何去做,请去做 =)
I have a file, which consists of a one row:
1 , 1 2 , 1 3 6 , 4 ,...
In this representation, spaces separate the integers and commas.
This string is so huge that I can't read it with RandomAccessFile.readLine() (almost 4 Gb needed). So that I created a buffer, which can contain 10 integers. My task is to sort all integers in the string.
Could you, please, help?
EDIT
@Oscar Reyes
I need to write some sequences of integers to a file and then to read from it. Actually I don't know, how to do it. I'm a newbie. So I decided to use chars to write integers, delimiters between integers are ",", and delimeters between sequences are "\n\r" which. So that I created a monster that reads it:
public BinaryRow getFilledBuffer(String filePath, long offset) throws IOException{
mainFile = new RandomAccessFile(filePath, "r");
if (mainFile.length() == 0){
return new BinaryRow();
}
StringBuilder str = new StringBuilder();
mainFile.seek(mainFile.length()-4); //that is "\n" symbol
char chN = mainFile.readChar();
mainFile.seek(offset);
int i = 0;
char nextChar = mainFile.readChar();
while (i < 11 && nextChar != chN){
str.append(nextChar);
if (nextChar == ','){
i++;
if (i == 10){
break;
}
}
nextChar = mainFile.readChar();
}
if (nextChar == chN){
position = -1;
}else{
position = mainFile.getFilePointer();
}
BinaryRow br = new BinaryRow();
StringBuilder temp = new StringBuilder();
for (int j = 0; j < str.length(); j++){
if ((str.charAt(j) != ',')){
temp.append(str.charAt(j));
if (j == str.length() - 1){
br.add(Integer.parseInt(temp.toString()));
}
}else{
br.add(Integer.parseInt(temp.toString()));
temp.delete(0, temp.length());
}
}
mainFile.close();
return br;
}
If you could advise how to do it, please do it =)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这正是 QuickSort 的起源,当时没有足够的 RAM 来在内存中排序,所以他们的程序是将部分结果存储在磁盘中。
所以你能做的就是:
当部分足够小时 (
像2一样直接交换它们足以在内存中排序)这样你就可以分块排序并将部分结果存储在临时文件中,你将得到一个包含结果的最终文件已排序。
编辑 我告诉过你可以进行快速排序。
毕竟,您似乎需要一些额外的空间来存放临时文件。
这就是我所做的。
我创建了一个 40 MB 的文件,其中的数字以逗号分隔。
我将其命名为
input
:input http://img200.imageshack。 us/img200/5129/capturadepantalla201003t.png
输入为 40mb
在排序过程中,会创建包含“大于”、“小于”值的 tmp 文件,并且在排序时完成后,这些值将发送到一个名为(猜猜是什么)
output
处理 http://img200.imageshack.us/img200/1672/capturadepantalla201003y.png
使用部分结果创建临时文件
最后所有 tmp 文件都被删除,结果是保存在文件“输出”中,并具有正确排序的数字序列:
输出 http://img203 .imageshack.us/img203/5950/capturadepantalla201003w.png
最后创建了“输出”文件,注意它也是 40 mb
这是完整的程序。
文件的格式是
number,number,number,number
您当前的格式是:
number , numb , be r
要解决此问题,您只需全部阅读并跳过空白。
为此添加另一个问题。
This is exactly the origin QuickSort back then there was not enough RAM to sort in memory so they procedure is to store partial results in disk.
So what you can do is:
When parts are small enough (
like 2 just direct swap themEnough to be sorted in memory )This way you'll be able to sort in chunks and store the partial results in temp files and you'll have a final file with the result sorted.
EDIT I told you a quick sort was possible.
It seems like you would need some extra space for the temp files after all.
Here's what I did.
I create a 40 mb file with numbers separated by commas.
I name it
input
:input http://img200.imageshack.us/img200/5129/capturadepantalla201003t.png
Input is 40mb
During the sort, the tmp files with the buckets of "greater than", "lower than" values are created and when the sort is finished, the values are sent to a file called ( guess what )
output
processing http://img200.imageshack.us/img200/1672/capturadepantalla201003y.png
Temp files are created with the partial results
Finally all the tmp files are deleted and the result is kept in the file "output" with the correct sorted sequence of numbers:
output http://img203.imageshack.us/img203/5950/capturadepantalla201003w.png
Finally the file "output" is created, notice it is 40 mb too
Here's the full program.
The format of the files is
number,number,number,number
Your current format is:
n u m b e r , n u m b , b e r
To fix that you just have to read it all and skip the blanks.
Add another question for that.
将其分块读取到内存(每个块 100 MB?),一次一个块,对其进行排序并保存到磁盘。
然后打开所有有序块,读取每个块的第一个元素,并将最低的元素附加到输出。然后读取刚刚读取的块的下一个元素并重复。
合并时,您可以保留从每个块读取的最后一个 int 的数组,然后迭代它以获得最低的值。然后,将刚刚使用的值替换为从中获取的块中的下一个元素。
Read it to memory in chunks (100 MB each?), one chunk at a time, sort it and save to disk.
Then open all the ordered chunks, read the first element of each, and append the lowest to the output. Then read the next element of the chunk you just read from and repeat.
When merging you can keep an array of the last int read from each chunk and just iterate over it to get the lowest. Then you substitute the value you just used with the next element in the chunk it was taken from.