Hadoop:仅设置一次像 hashSet 这样的变量,以便可以在每个映射任务中多次使用它
您好,我有一个 HashSet,需要在 hadoop 中的每个映射任务中使用。我不想多次初始化它。我听说可以通过在配置函数中设置变量来实现。欢迎任何建议。
Hi I have a HashSet which needs to be utilized in each and every map task in hadoop. I dont want to initialize it multiple times. I heard that it is possible by setting the variable in configure function. Any suggestions are welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
看来你还没有真正理解Hadoop的执行策略。
如果处于分布式模式,则无法在多个映射任务中共享集合(HashSet)。这是因为任务是在它们自己的 JVM 中执行的,并且它是不确定的,即使没有 jvm 重用,您的集合在 jvm 重置后仍然会在那里。
您可以做的是在计算开始时为每个任务设置一个
HashSet
。因此,您可以重写
setup(Context ctx)
方法。这将在调用地图方法之前调用。但是,您需要足够的内存来存储每个任务中的
HashSet
。如果你没有这个能力,你应该考虑分布式缓存解决方案,但这会有开销,因为每个查询都必须序列化和反序列化。并且不能保证数据在本地可用,因此这可能比任务中的收集花费更长的时间。
It seems that you haven't really understand the execution strategy of Hadoop.
If you are in distributed mode, you cannot share a collection (HashSet) throughout multiple map tasks. That is because Tasks are executed in their own JVM and it is not deterministic, even not with jvm reuse that your collection will still be there after the jvm has been resetted.
What you can do is you can setup a
HashSet
for each task at the beginning of the computation.Therefore you can override the
setup(Context ctx)
method. This will be called before the map-method is called.However you need enough ram to store the
HashSet
in each task.If you don't have this capacity, you should take a distributed cache solution into account, but this will have overhead because each query must be serialized and deserialized. And it is not guranteed that the data is locally available, so this may take a lot longer than a collection within the task.
Map 任务在多个节点上运行,每个节点有多个执行 Map 任务的 JVM。因此,按原样不可能在映射任务之间共享 HashSet。有几种方法可以缓解 OP
Use 任务jvm重用。
使用分布式缓存解决方案。
Map tasks run on multiple nodes and each node has multiple JVMs in which the map tasks execute. So, as-is it's not possible to share a HashSet across map tasks. There are a couple of ways to alleviate the problem mentioned in the OP
Use task jvm reuse.
Use a distributed cache solution.
如果您将 HaseSet 声明为静态,它将在每个任务中初始化一次
如果您需要在任务之间共享它,您需要一些共享,例如 Praveen 提到的分布式缓存
If you declare the HaseSet as static it will be initialized once per task
If you need to to share it between tasks you need something shares such as distributed cache mentioned by Praveen
问题是你想保存什么?如果是时候初始化该集合 - 那么我建议使用静态变量,如果为空,则该变量将被初始化并且永远不会被清除。因此,碰巧重用同一个 JVM 的每个任务都会使用它。
如果数据相对较小 - 您可以通过配置将其序列化为字符串和路径。
如果数据量很大,可以使用分布式缓存将数据传送到每个节点,然后每个虚拟机读取一次。
The question is what you want to save? If it is time to initialize the set - then I would suggest using static variable which will be initialized if null and never cleaned. As a result each task, which happens to reuse the same JVM will use it.
If the data is relatively small - you can serialize it as a string and path via configuration.
If the data is big - you can use distributed cache to deliver data to each node, and then read it once per VM.