如何将对象传递给Mapper和reducers
我有一个在 hadoop 上运行的应用程序。如何将对象传递给映射器和缩减器以处理数据。例如,我声明一个 FieldFilter 对象来过滤映射器中处理的行。过滤器包含许多由用户指定的过滤规则。所以,我想知道如何将过滤器和规则传递给映射器和缩减器? 我的想法是将对象序列化为String,通过configure传递字符串,然后通过字符串重新构造对象。但似乎对我来说不太好!还有其他方法吗? 谢谢!
public class FieldFilter {
private final ArrayList<FieldFilterRule> rules = new ArrayList<FieldFilterRule>();
public FieldFilter addRule(FieldFilterRule ... rules) {
for (int i = 0; i < rules.length; i++) {
this.rules.add(rules[i]);
rules[i].setFieldFilter(this);
}
return this;
} }
I have an application run on hadoop. How can I pass the objects to the mappers and reducers so as to process the data. For example, I declare a FieldFilter object for filter the rows processed in the Mappers. The filters contains many filter rules which are specified by users. So, I am wondering how can I pass the filters and rules to the Mappers and Reducers?
My idea is to serialize the objects into String, pass around the string by configure, re-then construct the object by the string. But seems not good for me! any other approaches?
thanks!
public class FieldFilter {
private final ArrayList<FieldFilterRule> rules = new ArrayList<FieldFilterRule>();
public FieldFilter addRule(FieldFilterRule ... rules) {
for (int i = 0; i < rules.length; i++) {
this.rules.add(rules[i]);
rules[i].setFieldFilter(this);
}
return this;
} }
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您想在
Configuration
中使用setClass()
,如您所见 此处。然后您可以通过newInstance()
实例化您的类。请记住在 Mapper/Reducer 的setup()
方法中进行实例化,这样就不会在每次调用 Map/Reduce 方法时都实例化过滤器。祝你好运。- 编辑。我应该补充一点,您可以通过上下文访问配置,这就是您获取所需类的方式。配置API中有一个
getClass()
方法。You want to use the
setClass()
in theConfiguration
as you can see here. You can, then instantiate your class bynewInstance()
. Remember to have the instantiating made in thesetup()
method of the mapper/reducer, so that you don't instantiate the filter every time map/reduce methods are invoked. Good luck.--Edit. I should add that you have access to the configuration through context, and that is how you will get the class you need. There is a
getClass()
method in the configuration api.序列化FieldFilter并将其放入到HDFS中,然后读取它使用 HDFS API 的映射器/减速器函数。如果您有一个大型集群,那么您可能需要增加 复制对于序列化的 FieldFilter 类,因子 默认为 3,因为大量的映射器和读取器任务将读取序列化的 FieldFilter 类。
如果使用新的 MapReduce API,则可以在 Mapper.setup() 函数。这是在映射任务初始化期间调用的。无法找到与旧 MapReduce API 类似的内容。
您还可以考虑使用 DistributedCache< /a> 将序列化的 FieldFilter 类分发到不同的节点。
Serialize FieldFilter and put it in HDFS and later read it in the mapper/reducer functions using the HDFS API. If you have a large cluster, then you might want to increase the replication factor which is defaulted to 3 for the serialized FieldFilter class, since a larger number of mapper and reader tasks would be reading the serialized FieldFilter class.
If new MapReduce API is used then the serialized FieldFilter file can be read in Mapper.setup() function. This is called during the initialization of the map task. Could not find something similar for the old MapReduce API.
You can also consider using DistributedCache to distribute the serialized FieldFilter class to the different nodes.