使用 Pig/Hive 进行数据处理而不是直接使用 java map reduce 代码?
(甚至比 Pig 和 Hive 之间的区别更基本?为什么两者都有?< /a>)
我有一个数据处理管道,是通过 Hadoop 上的几个 Java map-reduce 任务编写的(我自己的自定义代码,源自 Hadoop 的 Mapper 和Reducer)。它是一系列的基本操作,例如连接、逆向、排序和分组。我的代码涉及到并且不是很通用。
与使用多个 UDF 将所有内容迁移到 Pig/Hive 相比,继续这种公认的开发密集型方法有何优缺点?我无法执行哪些工作?我会遭受性能下降(处理数百个 TB 数据)吗?维护时我会失去调整和调试代码的能力吗?我是否能够将部分作业作为 Java map-reduce 进行管道化,并将其输入输出与我的 Pig/Hive 作业一起使用?
(Even more basic than Difference between Pig and Hive? Why have both?)
I have a data processing pipeline written in several Java map-reduce tasks over Hadoop (my own custom code, derived from Hadoop's Mapper and Reducer). It's a series of basic operations such as join, inverse, sort and group by. My code is involved and not very generic.
What are the pros and cons of continuing this admittedly development-intensive approach vs. migrating everything to Pig/Hive with several UDFs? which jobs won't I be able to execute? will I suffer a performance degradation (working with 100s of TB)? will I lose ability to tweak and debug my code when maintaining? will I be able to pipeline part of the jobs as Java map-reduce and use their input-output with my Pig/Hive jobs?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
参考 Twitter :通常 Pig 脚本占 5%原生map/reduce代码的编写时间约为5%。然而,查询执行时间通常是本机 Map/Reduce 作业的 110-150%。但当然,如果有一个对性能高度敏感的例程,他们仍然可以选择直接手动编码本机映射/归约函数。
上面的参考文献还讨论了 Pig 相对于在 MapReduce 中开发应用程序的优缺点。
与任何高级语言或抽象一样,Pig/Hive 会损失灵活性和性能,而代价是开发人员的生产力下降。
Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.
The above reference also talks about pros and cons of Pig over developing applications in MapReduce.
As with any higher level language or abstraction, flexibility and performance is lost with Pig/Hive at the expense of developer productivity.
在这篇论文中 据报道,截至 2009 年,Pig 的运行速度比普通 MapReduce 慢 1.5 倍。预计构建在 Hadoop 之上的更高级别工具的执行速度会比普通 MapReduce 慢,但是,为了使 MapReduce 能够以最佳方式执行,需要编写大量样板代码的高级用户(例如二进制比较器)。
我发现有必要提及一个名为 Pangool (我是其开发人员)的新 API,该 API 旨在取代普通的 API Hadoop MapReduce API 使很多事情更容易编码和理解(二次排序、reduce 端连接)。 Pangool 不会施加性能开销(自其第一个基准测试以来,性能开销仅为 5%)并保留了所有灵活性原始 MapRed API 的一部分。
In this paper as of 2009 it is stated that Pig runs 1.5 times slower than plain MapReduce. It is expected that higher level tools built on top of Hadoop perform slower than plain MapReduce, however it is true that in order to have MapReduce perform optimally an advanced user that writes a lot of boilerplate code is needed (e.g. binary comparators).
I find it relevant to mention a new API called Pangool (which I'm a developer of) that aims to replace the plain Hadoop MapReduce API by making a lot of things easier to code and understand (secondary sort, reduce-side joins). Pangool doesn't impose a performance overhead (barely 5% as of its first benchmark) and retains all the flexibilty of the original MapRed API.