微调 PIG 以进行本地执行
我使用 PIG latin 进行日志处理,因为它在数据不够大而无需担心设置整个 hadoop 集群的问题上具有表现力。我在本地模式下运行 PIG,但我认为它没有使用所有可用的核心(目前有 16 个),监控 CPU 显示 CPU 使用率最多为 200%。
是否有任何针对本地执行微调 PIG 的教程或建议?我确信所有映射器都可以通过一些简单的调整来使用所有可用的内核。 (在我的脚本中,我已经将 default_parallel 参数设置为 20)
祝好。
I'm using PIG latin for log processing because its expressiveness in a problem where the data is not big enough to worry about setting up a whole hadoop cluster. I'm running PIG in local mode but I think that it isn't using all the cores it has available (16 at the moment), monitoring the CPU shows 200% of CPU usage at maximum.
Is there any tutorial or recommendations for fine tuning PIG for local execution? I'm sure that all the mappers could use all the available cores with some easy tweaking. (In my script I have already set up the default_parallel parameter to 20)
Best regards.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Pig 的文档 明确指出本地操作旨在单线程运行,对某些函数采用不同的代码路径否则使用分布式排序。因此,针对 Pig 的本地模式进行优化似乎是解决当前问题的错误方法。
您是否考虑过运行本地“伪分布式”集群,而不是投资完整的集群设置?您可以按照 Hadoop 的伪分布式操作说明,然后将 Pig 指向 <代码>本地主机。这将获得期望的结果,但代价是两步启动和拆卸。
您需要增加默认映射器和化简器的数量以消耗计算机上可用的所有内核。幸运的是,这有相当详细的文档记录(诚然,在集群设置文档中);只需在
$HADOOP_HOME/conf/mapred- 的本地副本中定义
。mapred.tasktracker.map.tasks.maximum
和mapred.tasktracker.reduce.tasks.maximum
site.xmlPig's documentation makes it clear that local operation is intended to be run single-threaded, taking different code paths for certain functions that would otherwise use distributed sort. As a result, optimizing for Pig's local mode seems like the wrong solution to the presented problem.
Have you considered running a local, "pseudo-distributed" cluster instead of investing in full cluster setup? You can follow Hadoop's instructions for pseudo-distributed operation, then point Pig at
localhost
. This would have the desired result, at the expense of two-step startup and teardown.You'll want to raise the number of default mappers and reducers to consume all cores available on your machine. Fortunately, this is reasonably well-documented (admittedly, in the cluster setup documentation); simply define
mapred.tasktracker.map.tasks.maximum
andmapred.tasktracker.reduce.tasks.maximum
in your local copy of$HADOOP_HOME/conf/mapred-site.xml
.