最有效地合并 2 个文本文件。
因此,我有成对的大型(每个大约 4 GB)txt 文件,我需要创建一个第三个文件,其中包含随机播放模式下的 2 个文件。下面的等式最能说明这一点:
3rdfile = (文件 1 中的 4 行) + (文件 2 中的 4 行) 重复此操作,直到到达文件 1 的末尾(两个输入文件将具有相同的长度 - 这是根据定义的)。这是我现在使用的代码,但这在大文件上不能很好地扩展。我想知道是否有更有效的方法来做到这一点 - 使用内存映射文件会有帮助吗?欢迎所有想法。
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}
So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
也许您还想尝试使用 BufferedWriter 来减少文件 IO 操作。
http://download.oracle.com/javase/6 /docs/api/java/io/BufferedWriter.html
Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html
一个简单的答案是使用更大的缓冲区,这有助于减少 I/O 调用的总数。
通常,带有 FileChannel 的内存映射 IO(请参阅 Java NIO)将用于处理大数据文件 IO。但在本例中,情况并非如此,因为您需要检查文件内容以确定每 4 行的边界。
A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.
如果性能是主要要求,那么我会用 C 或 C++ 而不是 Java 编写这个函数。
但不管使用什么语言,我要做的就是尝试自己管理记忆。我将创建两个大缓冲区,例如每个 128MB 或更多,并用两个文本文件中的数据填充它们。然后您需要第三个缓冲区,其大小是前两个缓冲区的两倍。该算法将开始将字符从输入缓冲区#1 一个接一个地移动到目标缓冲区,同时对 EOL 进行计数。一旦到达第四行,您就将当前位置存储在该缓冲区上,并对第二个输入缓冲区重复相同的过程。您继续在两个输入缓冲区之间交替,在消耗完缓冲区中的所有数据后补充缓冲区。每次必须重新填充输入缓冲区时,您还可以写入目标缓冲区并将其清空。
If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.
缓冲您的读写操作。缓冲区需要足够大,以最大限度地减少读/写操作,同时仍然保持内存效率。这非常简单而且有效。
编辑:
我刚刚意识到您需要重新排列行,因此此代码不会按原样为您工作,但概念仍然保持不变。
Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.