HDFS:使用 HDFS API 附加到 SequenceFile
我一直在尝试使用 Java API 在 HDFS 上创建和维护序列文件,而不运行 MapReduce 作业作为未来 MapReduce 作业的设置。我想将 MapReduce 作业的所有输入数据存储在单个序列文件中,但数据会随着时间的推移而附加。问题是,如果 SequenceFile 存在,则以下调用将仅覆盖 SequenceFile 而不是附加到它。
// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();
另一个问题是,我无法维护自己格式的文件并在一天结束时将数据转换为 SequenceFile,因为可以随时使用该数据启动 MapReduce 作业。
我找不到任何其他 API 调用来附加到 SequenceFile 并维护其格式。由于格式需要,我也不能简单地连接两个 SequenceFile。
我还想避免为此运行 MapReduce 作业,因为对于我添加到 SequenceFile 的少量数据来说,它的开销很高。
有什么想法或解决方法吗?谢谢。
I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.
// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();
Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.
I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.
I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.
Any thoughts or work-arounds? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通过增强 JIRA,已在 Apache Hadoop 2.6.1 和 2.7.2 版本中添加了对追加到现有
SequenceFiles
的支持:https://issues.apache.org/jira/browse/HADOOP-7139例如用法,可以阅读测试用例:
CDH5 用户可以在 CDH 5.7.1 及以后的版本中找到相同的功能。
Support for appending to existing
SequenceFiles
has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140
CDH5 users can find the same ability in version CDH 5.7.1 onwards.
抱歉,目前 Hadoop 文件系统不支持追加。但有计划在未来的版本中实现这一点。
Sorry, currently the Hadoop FileSystem does not support appends. But there are plans for it in a future release.