HDFS:使用 HDFS API 附加到 SequenceFile

发布于 2024-10-31 15:07:23 字数 755 浏览 2 评论 0原文

我一直在尝试使用 Java API 在 HDFS 上创建和维护序列文件,而不运行 MapReduce 作业作为未来 MapReduce 作业的设置。我想将 MapReduce 作业的所有输入数据存储在单个序列文件中,但数据会随着时间的推移而附加。问题是,如果 SequenceFile 存在,则以下调用将仅覆盖 SequenceFile 而不是附加到它。

// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
               keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();

另一个问题是,我无法维护自己格式的文件并在一天结束时将数据转换为 SequenceFile,因为可以随时使用该数据启动 MapReduce 作业。

我找不到任何其他 API 调用来附加到 SequenceFile 并维护其格式。由于格式需要,我也不能简单地连接两个 SequenceFile。

我还想避免为此运行 MapReduce 作业,因为对于我添加到 SequenceFile 的少量数据来说,它的开销很高。

有什么想法或解决方法吗?谢谢。

I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.

// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
               keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();

Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.

I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.

I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.

Any thoughts or work-arounds? Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

丢了幸福的猪 2024-11-07 15:07:23

通过增强 JIRA,已在 Apache Hadoop 2.6.1 和 2.7.2 版本中添加了对追加到现有 SequenceFiles 的支持:https://issues.apache.org/jira/browse/HADOOP-7139

例如用法,可以阅读测试用例:

CDH5 用户可以在 CDH 5.7.1 及以后的版本中找到相同的功能。

Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139

For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140

CDH5 users can find the same ability in version CDH 5.7.1 onwards.

自演自醉 2024-11-07 15:07:23

抱歉,目前 Hadoop 文件系统不支持追加。但有计划在未来的版本中实现这一点。

Sorry, currently the Hadoop FileSystem does not support appends. But there are plans for it in a future release.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文