为什么Hadoop文件系统不支持随机I/O?
Google File System、Hadoop 等分布式文件系统不支持随机 I/O。
(它不能修改以前写入的文件。只能写入和追加。)
为什么他们要这样设计文件系统?
该设计有哪些重要优点?
PS我知道Hadoop将支持修改写入的数据。
但他们表示,它的性能会很不好。为什么?
The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Hadoop 分发和复制文件。由于文件是复制的,任何写入操作都必须通过网络找到每个复制的部分并更新文件。这将大大增加手术时间。更新文件可能会超出块大小,并需要将文件分成 2 个块,然后复制第二个块。我不知道内部结构以及何时/如何分割一个块......但这是一个潜在的复杂性。
如果作业失败或被杀死,并且已经进行了更新并重新运行,该怎么办?它可以多次更新文件。
在分布式系统中不更新文件的优点是,当你更新文件时,你不知道还有谁在使用该文件,你也不知道这些片段存储在哪里。存在潜在的超时(带有块的节点无响应),因此您可能最终会得到不匹配的数据(同样,我不知道 hadoop 的内部结构,并且可能会处理节点关闭的更新,这只是我正在集思广益的事情)
更新 HDFS 上的文件存在很多潜在问题(上面列出了一些问题)。它们都不是不可克服的,但它们需要对性能造成影响来检查和解释。
由于 HDFS 的主要目的是存储用于 MapReduce 的数据,因此行级更新在此阶段并不那么重要。
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
我认为这是因为数据的块大小和 Hadoop 的整体思想是你不移动数据,而是将算法移动到数据上。
Hadoop 专为数据的非实时批处理而设计。如果您正在寻找在响应时间和随机访问方面实现更像传统 RDBMS 的方法,请查看 HBase 它构建在 Hadoop 之上。
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.