将大型 gzip 数据文件上传到 HDFS
我有一个用例,我想在 HDFS 上上传大的 gzip 压缩文本数据文件(~ 60 GB)。
我下面的代码大约需要 2 小时才能以 500 MB 的块上传这些文件。以下是伪代码。我正在检查是否有人可以帮助我减少这个时间:
i) int fileFetchBuffer = 500000000; System.out.println("文件获取缓冲区为:" + fileFetchBuffer); 整数偏移量=0; int 字节读取 = -1;
try {
fileStream = new FileInputStream (file);
if (fileName.endsWith(".gz")) {
stream = new GZIPInputStream(fileStream);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
String[] fileN = fileName.split("\\.");
System.out.println("fil 0 : " + fileN[0]);
System.out.println("fil 1 : " + fileN[1]);
//logger.info("First line is: " + streamBuff.readLine());
byte[] buffer = new byte[fileFetchBuffer];
FileSystem fs = FileSystem.get(conf);
int charsLeft = fileFetchBuffer;
while (true) {
charsLeft = fileFetchBuffer;
logger.info("charsLeft outside while: " + charsLeft);
FSDataOutputStream dos = null;
while (charsLeft != 0) {
bytesRead = stream.read(buffer, 0, charsLeft);
if (bytesRead < 0) {
dos.flush();
dos.close();
break;
}
offset = offset + bytesRead;
charsLeft = charsLeft - bytesRead;
logger.info("offset in record: " + offset);
logger.info("charsLeft: " + charsLeft);
logger.info("bytesRead in record: " + bytesRead);
//prettyPrintHex(buffer);
String outFileStr = Utils.getOutputFileName(
stagingDir,
fileN[0],
outFileNum);
if (dos == null) {
Path outFile = new Path(outFileStr);
if (fs.exists(outFile)) {
fs.delete(outFile, false);
}
dos = fs.create(outFile);
}
dos.write(buffer, 0, bytesRead);
}
logger.info("done writing: " + outFileNum);
dos.flush();
dos.close();
if (bytesRead < 0) {
dos.flush();
dos.close();
break;
}
outFileNum++;
} // end of if
} else {
// Assume uncompressed file
stream = fileStream;
}
} catch(FileNotFoundException e) {
logger.error("File not found" + e);
}
I have a use case where I want to upload big gzipped text data files (~ 60 GB) on HDFS.
My code below is taking about 2 hours to upload these files in chunks of 500 MB. Following is the pseudo code. I was chekcing if somebody could help me reduce this time:
i) int fileFetchBuffer = 500000000;
System.out.println("file fetch buffer is: " + fileFetchBuffer);
int offset = 0;
int bytesRead = -1;
try {
fileStream = new FileInputStream (file);
if (fileName.endsWith(".gz")) {
stream = new GZIPInputStream(fileStream);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
String[] fileN = fileName.split("\\.");
System.out.println("fil 0 : " + fileN[0]);
System.out.println("fil 1 : " + fileN[1]);
//logger.info("First line is: " + streamBuff.readLine());
byte[] buffer = new byte[fileFetchBuffer];
FileSystem fs = FileSystem.get(conf);
int charsLeft = fileFetchBuffer;
while (true) {
charsLeft = fileFetchBuffer;
logger.info("charsLeft outside while: " + charsLeft);
FSDataOutputStream dos = null;
while (charsLeft != 0) {
bytesRead = stream.read(buffer, 0, charsLeft);
if (bytesRead < 0) {
dos.flush();
dos.close();
break;
}
offset = offset + bytesRead;
charsLeft = charsLeft - bytesRead;
logger.info("offset in record: " + offset);
logger.info("charsLeft: " + charsLeft);
logger.info("bytesRead in record: " + bytesRead);
//prettyPrintHex(buffer);
String outFileStr = Utils.getOutputFileName(
stagingDir,
fileN[0],
outFileNum);
if (dos == null) {
Path outFile = new Path(outFileStr);
if (fs.exists(outFile)) {
fs.delete(outFile, false);
}
dos = fs.create(outFile);
}
dos.write(buffer, 0, bytesRead);
}
logger.info("done writing: " + outFileNum);
dos.flush();
dos.close();
if (bytesRead < 0) {
dos.flush();
dos.close();
break;
}
outFileNum++;
} // end of if
} else {
// Assume uncompressed file
stream = fileStream;
}
} catch(FileNotFoundException e) {
logger.error("File not found" + e);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您应该考虑使用 来自 Apache 的超级包 IO< /a>.
它有一种方法
可以极大地减少复制文件所需的时间。
You should consider using the super package IO from Apache.
It has a method
that would tremendously reduce time needed to copy your files.
我尝试使用缓冲输入流,但没有发现真正的区别。
我认为文件通道的实现可能会更加高效。告诉我是否不够快。
问候,
史蒂芬
I tried with buffered input stream and saw no real difference.
I suppose a file channel implementation could be even more efficient. Tell me if it's not fast enough.
Greetings,
Stéphane