我的Hive-UDF怎么了?如何设置hive的map号?
我使用Hadoop-Hive来分析apache日志来统计访问特征。我写了一个名为GetCity的UDF来将remote_ip转换为城市名称,但是当我运行“select GetCity(remote_ip) from log_pre;”时,速度非常慢,当数据太大超过1000条时甚至会失败。 我尝试设置mapred.reduce.tasks = 10,但jobtracker显示地图总数仍然是1。选择时如何设置更多地图?
I use Hadoop-Hive to analyse apache log to statis access features. I write a UDF named GetCity to convert the remote_ip to city name, but when I run "select GetCity(remote_ip) from log_pre;", it's very slow, and even failed when the data is too large as more than 1000 items.
I tried to set mapred.reduce.tasks=10, but the jobtracker shown the map total num is 1 all the same. How can I set more maps when select?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当执行这样的查询时,“GetCity(remote_ip)”调用总是发生在映射器上。事实上,我怀疑除了文件串联之外,这里的减速器中是否发生了任何事情。您可以通过调用以下命令来控制 hive 中映射器中使用的任务数量:
SET mapred.map.tasks=10;
希望这有帮助,
synctree
When performing a query like this the "GetCity(remote_ip)" call always happens on the mapper. In fact, I am doubtful there is anything going on in the reducer here except for maybe file concatenation. You can control the number of tasks that get used in the mapper from hive by calling:
SET mapred.map.tasks=10;
Hope this helps,
synctree