是否可以并行从多个客户端附加到 HDFS 文件?
基本上整个问题都在标题中。我想知道是否可以同时从多台计算机附加到位于 HDFS 上的文件?比如存储由多个进程不断产生的事件流。顺序并不重要。
我记得在一次 Google 技术演示中听说 GFS 支持此类附加功能,但尝试使用 HDFS 进行一些有限的测试(使用常规文件 append() 或使用 SequenceFile)似乎不起作用。
谢谢,
Basically whole question is in the title. I'm wondering if it's possible to append to file located on HDFS from multiple computers simultaneously? Something like storing stream of events constantly produced by multiple processes. Order is not important.
I recall hearing on one of the Google tech presentations that GFS supports such append functionality but trying some limited testing with HDFS (either with regular file append() or with SequenceFile) doesn't seems to work.
Thanks,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为 HDFS 不可能做到这一点。即使您不关心记录的顺序,但您确实关心文件中字节的顺序。您不希望写入器 A 写入部分记录,然后被写入器 B 损坏。这是 HDFS 无法自行解决的难题,因此它不会这样做。
为每个写入器创建一个文件。将所有文件传递给任何需要读取此数据的 MapReduce 工作程序。这更加简单并且适合HDFS和Hadoop的设计。如果非 MapReduce 代码需要将这些数据作为一个流读取,那么要么按顺序流式传输每个文件,要么编写一个非常快速的 MapReduce 作业来合并文件。
I don't think that this is possible with HDFS. Even though you don't care about the order of the records, you do care about the order of the bytes in the file. You don't want writer A to write a partial record that then gets corrupted by writer B. This is a hard problem for HDFS to solve on its own, so it doesn't.
Create a file per writer. Pass all the files to any MapReduce worker that needs to read this data. This is much simpler and fits the design of HDFS and Hadoop. If non-MapReduce code needs to read this data as one stream then either stream each file sequentially or write a very quick MapReduce job to consolidate the files.
仅供参考,根据官方网站上的 JIRA 项目,hadoop 2.6.x 可能会完全支持它: https://issues.apache.org/jira/browse/HDFS-7203
just FYI, probably it'd be fully supported in hadoop 2.6.x, acorrding to the JIRA item on the official site: https://issues.apache.org/jira/browse/HDFS-7203