为什么 TeraSort 映射阶段在 CRC32.update() 函数中花费大量时间?
我正在尝试分析哪些函数在 TeraSort Hadoop 作业中消耗最多时间。对于我的测试系统,我使用基本的 1 节点伪分布式设置。这意味着NameNode、DataNode、Tasktracker 和Jobtracker JVM 都运行在同一台机器上。
我首先使用 TeraGen 生成约 9GB 的数据,然后对其运行 TeraSort。当 JVM 执行时,我使用 VisualVM 对它们的执行进行采样。我知道这不是最准确的分析器,但它是免费且易于使用的!我使用最新版本的 Apache hadoop 发行版,并且我的实验在基于 Intel Atom 的系统上运行。
当我查看 VisualVM 中热点方法的自用时间 (CPU) 时,我发现 java.util.zip.CRC32.update() 函数占用了总时间的近 40%。当我在调用树中查看此函数时,它是由映射器的 main() 函数调用的,特别是当 IdentityMapper.map() 从 HDFS 读取输入文件时。实际调用 CRC32.update() 函数的函数是 org.apache.hadoop.fs.FSInputChecker.readChecksumChunk()
我对此有三个问题:
为什么要为正在读取的块更新 CRC32 校验和HDFS?如果我理解正确的话,一旦读取了一个块,从磁盘读取的数据与该块的 CRC 的简单比较应该是唯一的操作,而不是生成和更新块的 CRC 值。
我查了更新函数的源码,它是由java.util.zip.CRC32.java文件实现的。调用的具体函数是具有三个参数的重载 update() 方法。由于这个函数是用Java实现的,是否有可能多层抽象(Hadoop,JVM,CPU指令)降低了CRC计算的本机效率?
最后,我的 VisualVM 检测方法或采样结果的解释是否存在严重错误?
谢谢,
I am trying to profile which functions consume the most time for a TeraSort Hadoop job. for my test system, I am using a basic 1-node pseudo-distributed setup. This means that the NameNode, DataNode, Tasktracker, and Jobtracker JVMs all run on the same machine.
I first generate ~9GB of data using TeraGen and then run the TeraSort on it. While the JVMs execute, I sample their execution using VisualVM. I know this is not the most accurate profiler out there, but it's free and easy to use! I use the latest version of Apache hadoop distribution, and my experiments are run on an Intel Atom based system.
When I look at the Self time (CPU) for Hot Spots-Methods in VisualVM, I see java.util.zip.CRC32.update() function taking up nearly 40% of total time. When I look at this function in the call tree, it's invoked by the main() function of the mapper, specifically when the IdentityMapper.map() is reading input files from the HDFS. The function that actually makes the call to CRC32.update() function is org.apache.hadoop.fs.FSInputChecker.readChecksumChunk()
I have a three questions regarding this:
Why is CRC32 checksum being updated for blocks being read from the HDFS? If I understand correctly, once a block is read, a simple comparison of the data read from the disk with the block's CRC should be the only operation, not generating and updating the blocks CRC value.
I looked up the source for the update function, and it's implemented by the java.util.zip.CRC32.java file. The specific function called is the overloaded update() method with three arguments. Since this function is implemented in Java, is it possible that multiple layers of abstraction (Hadoop, JVM, CPU instructions) are reducing the native efficiency of CRC calculation?
Finally, is there something grossly wrong with my VisualVM instrumentation methodology, or interpretation of the sampling results?
Thanks,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于你的第一个问题,我认为答案是 CRC 文件有副本并且可能会损坏。例如,假设我们有一堆复制因子为 2 的文件/目录,则可能会发生以下情况,并且需要重新计算和更新 CRC:
如果您看一下 JIRA 问题对于 Hadoop Common,您可以找到许多与 CRC 损坏相关的问题。
对于第二个问题,您能告诉我您使用的是哪个版本的Hadoop吗? CRC的效率一次次被抱怨,一次次提高。
To your first question, I think the answer is that the CRC files have replicas and can be corrupted. For example, assume we have a bunch of files/directories with replication factor of 2, then following scenarios can happen and CRC will need to be recalculated and updated:
If you take a look at the JIRA issues for Hadoop Common, you can find many issues related with CRC corruption.
For the second question, could you tell me which version of Hadoop are you using? The efficiency of CRC has been complained and improved again and again.