想要比较 Hadoop 上的两个连续作业
我想知道是否可以比较 Hadoop 中的两个连续作业。如果没有,如果有人能告诉我如何继续,我将不胜感激。准确地说,我想根据两项工作到底做了什么来比较这些工作?这样做的原因是创建一个统计数据,了解 Hadoop 上执行的有多少作业在行为方面相似。例如,对同一输入执行同一排序函数多少次。
例如,如果第一份工作执行了像 SortList(A) 这样的操作,而其他一些工作执行了 SortList(A)+Group(result(SortList(A)) 。现在,我想知道在 Hadoop 中是否有一些映射存储在像 JobID X 这样的地方-> SortList(A)
到目前为止,我认为这个问题是在 Hadoop 中找到入口点,并尝试了解作业是如何创建的以及以什么形式(以代码形式或一些描述),但我无法成功地弄清楚。
I want to know if I can compare two consecutive jobs in Hadoop. If not I would appreciate if anyone can tell me how to proceed with that. To be precise, I want to compare the jobs in terms of what exactly two jobs did? The reason behind doing this is to create a statistics about how many jobs executed on Hadoop were similar in terms of the behavior. For example how many times same sorting function was executed on the same input.
For example if first job did something like SortList(A) and some other job did SortList(A)+Group(result(SortList(A)). Now, I am wondering if in Hadoop there is some mapping being stored somewhere like JobID X-> SortList(A).
So far, I thought of this problem as finding the entry point in Hadoop and try to understand how job is created and what information is being kept with a jobID and in what form (in a code form or some description) , but I was not able to figure it out successfully.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Hadoop 的计数器可能是一个不错的起点。您可以定义自己的计数器名称(就像每个计数器名称都是您正在处理的数据集),并在每次对其执行排序时递增该计数器。然而,找到您正在处理的数据集可能是更困难的任务。
这是我找到的一个教程:
http://philippeadjiman。 com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/
Hadoop's Counters might be a good place to start. You can define your own counter names (like each counter name is a data set you are working on) and increment that counter each time you perform a sort on it. Finding which data set you are working on, however, may be the more difficult task.
Here's a tutorial I found:
http://philippeadjiman.com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/
不。Hadoop 作业只是程序。它们可能有任何副作用。他们可以写入普通文件、hdfs 文件或数据库。 hadoop 中没有任何内容记录他们的所有活动。所有hadoop 都是管理调度和数据流。
No. Hadoop jobs are just programs. They can have any side effects. They can write ordinary files, hdfs file, or a database. Nothing in hadoop is recording all of their activities. All hadoop is manage the schedule and the flow of data.