在 hadoop 上组合两组输入

发布于 2024-08-30 04:20:29 字数 469 浏览 1 评论 0原文

我有一个相当简单的 hadoop 问题,我将尝试用一个示例来演示,

假设您有一个字符串列表和一个大文件,并且您希望每个映射器处理文件的一部分和 grep 之类的程序中的一个字符串。

你应该怎么做?我的印象是映射器的数量是产生的 inputSplits 的结果。我可以运行后续作业,每个字符串一个,但看起来有点......混乱?

编辑:我实际上并没有尝试构建 grep 地图缩减版本。我用它作为映射器有两个不同输入的示例。假设我列出了 A 和 B,并希望映射器处理列表 A 中的 1 个元素和列表 B 中的 1 个元素,

因此,考虑到该问题不存在导致需要链接作业的数据依赖性,我的方法是唯一的选择是以某种方式在所有映射器上共享所有列表 A,然后将列表 B 的 1 个元素输入到每个映射器?

我想做的是为我的数据构建某种类型的前缀查找结构。所以我有一个巨大的文本和一组字符串。这个过程有很强的内存瓶颈,因此我需要每个映射器 1 块文本/1 个字符串

I have a rather simple hadoop question which I'll try to present with an example

say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program.

how are you supposed to do that? I am under the impression that the number of mappers is a result of the inputSplits produced. I could run subsequent jobs, one for each string, but it seems kinda... messy?

edit: I am not actually trying to build a grep map reduce version. I used it as an example of having 2 different inputs to a mapper. Let's just say that I lists A and B and would like for a mapper to work on 1 element from list A and 1 element from list B

So given that the problem experiences no data dependency that would result in the need for chaining jobs, is my only option to somehow share all of list A on all mappers and then input 1 element of list B to each mapper?

What I am trying to do is built some type of a prefixed look-up structure for my data. So I have a giant text and a set of strings. This process has a strong memory bottleneck, therefore I was after 1 chunk of text/1 string per mapper

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迷路的信 2024-09-06 04:20:30

映射器应该能够独立工作并且没有副作用。并行性可以是映射器尝试将一条线与所有模式相匹配。每个输入仅处理一次!

否则,您可以将每个输入行乘以模式数。使用单一模式处理每一行。然后运行减速器。 ChainMapper 是这里选择的解决方案。但请记住:如果一行匹配两个模式,它就会出现两次。这就是你想要的吗?

在我看来,您应该更喜欢第一种情况:每个映射器独立处理一条线,并根据所有已知模式对其进行检查。

提示:您可以使用 DistributedCache 功能将模式分发给所有映射器! ;-) 输入应使用 InputLineFormat 进行分割

Mappers should be able to work independent and w/o side effects. The parallelism can be, that a mapper tries to match a line with all patterns. Each input is only processed once!

Otherwise you could multiply each input line with the number of patterns. Process each line with a single pattern. And run the reducer afterwards. A ChainMapper is the solution of choice here. But remember: A line will appear twice, if it matches two patterns. Is that what you want?

In my opinion you should prefer the first scenario: Each mapper processes a line independently and checks it against all known patterns.

Hint: You can distribute the patterns with the DistributedCache feature to all mappers! ;-) Input should be splitted with the InputLineFormat

话少情深 2024-09-06 04:20:30

一位好朋友顿悟了:串联 2 个映射器怎么样?

主要是运行一个启动映射器(无减速器)的作业。输入是字符串列表,我们可以进行安排,以便每个映射器仅获取一个字符串。

反过来,第一个映射器开始一项新作业,其中输入是文本。它可以通过在上下文中设置变量来传达字符串。

a good friend had a great epiphany: what about chaining 2 mappers?

in the main, run a job that fires up a mapper (no reducer). The input is the list of strings, and we can arrange things so that each mapper gets one string only.

in turn, the first mapper starts a new job, where the input is the text. It can communicate the string by setting a variable in the context.

伤痕我心 2024-09-06 04:20:30

关于您的编辑:
一般来说,映射器不用于同时处理 2 个元素。他一次只能处理一个元素。该作业的设计方式应该是每个输入记录都可以有一个映射器并且它仍然可以正确运行!

当然,映射器需要一些支持信息来处理输入是合适的。可以通过作业配置(例如 Configuration.setString())绕过此信息。更大的数据集应通过分布式缓存传递。

您看过其中一个选项吗?
我不确定我是否完全理解你的问题,所以请自行检查这是否有效;-)

顺便说一句:对我之前经过充分调查的答案投一票表示赞赏会很好;-)

Regarding your edit:
In general a mapper is not used to process 2 elements at once. He shall only process one element a time. The job should be designed in a way, that there could be a mapper for each input record and it would still run correctly!

Of course it is suitable, that the mapper needs some supporting information to process the input. This information can be by-passed with the Job Configuration (Configuration.setString() for example). A larger set of data shall be passed via the distributed cache.

Did you have a look on one of these options?
I'm not sure if I fully understood your problem, so please check by yourself if that would work ;-)

BTW: A appreciating vote for my well investigated previous answer would be nice ;-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文