如何将一组文本作为一个整体映射到一个节点?
假设我有一个包含以下数据的纯文本文件:
DataSetOne <br />
content <br />
content <br />
content <br />
DataSetTwo <br />
content <br />
content <br />
content <br />
content <br />
...等等...
我想要的是:计算每个数据集中有多少内容。例如结果应该是
<DataSetOne, 3>, <DataSetTwo, 4>
我是hadoop的初学者,我想知道是否有一种方法可以将一大块数据作为一个整体映射到一个节点。例如,将所有 DataSetOne 设置为节点 1,将所有 DataSetTwo 设置为节点 2。
有谁能给我一个如何存档的想法吗?
Suppose I have a plain text file with the following data:
DataSetOne <br />
content <br />
content <br />
content <br />
DataSetTwo <br />
content <br />
content <br />
content <br />
content <br />
...and so on...
What I want to to is: count how many contents in each data set. For example the result should be
<DataSetOne, 3>, <DataSetTwo, 4>
I am a beginer to hadoop, I wonder if there is a way to map a chunk of data as a whole to a node. for example, set all DataSetOne to node 1 and all DataSetTwo to node 2.
Does anyone can give me an idea how to archive this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为简单的方法是在映射器中实现逻辑,您会记住的
什么是当前数据集并发出如下对:
(DataSetOne,内容)
(DataSetOne,内容)
(DataSetOne,内容)
(数据集二,内容)
(DataSetTwo,内容)
然后您将在reduce阶段对组进行计数。
如果性能成为一个问题,我建议考虑组合器。
I think the simple way will be to implement the logic in the mapper, where you will remember
what is a current dataSet and emit pairs like this:
(DataSetOne, content)
(DataSetOne, content)
(DataSetOne, content)
(DataSetTwo, content)
(DataSetTwo, content)
And then you will countgroups in the reduce stage.
If performance will became an issue I would suggest to consider combiner.
首先,如果数据集位于单独的文件中或超出配置的块大小,则数据集将拆分为多个地图。因此,如果您有一个 128MB 的数据集,并且块大小为 64mb,hadoop 将会对该文件进行 2 块处理,并为每个文件设置 2 个映射器。
这就像 hadoop 教程中的字数统计示例。就像 David 所说,您需要将键/值对映射到 HDFS,然后减少它们。
我会这样实现:
就像大卫所说,你也可以使用组合器。组合器是简单的缩减器,用于在映射和缩减阶段之间节省资源。它们可以在配置中设置。
First of all your datasets are split for multiple maps if they are in seperate files or if they exceed the configured blocksize. So if you have one dataset of 128MB and your chunksize is 64mb hadoop will 2-block this file and setup 2 mappers for each.
This is like the wordcount example in the hadoop tutorials. Like David says you'll need to map the key/value pairs into HDFS and then reduce on them.
I would implement that like this:
Like David said aswell you could use combiner. Combiners are simple reducers and are used to save ressources between the map and reduce phase. They can be set in the configuration.
您可以扩展 FileInputFormat 类并实现 RecordReader 接口(或者,如果您使用的是较新的 API,则扩展 RecordReader 抽象类)来定义如何拆分数据。下面的链接为您提供了如何使用旧 API 实现这些类的示例。
http://www.questionhub.com/StackOverflow/4235318
You can extend the FileInputFormat class and implement the RecordReader interface (or if you're using the newer API, extend the RecordReader abstract class) to define how you split your data. Here is a link that gives you an example of how to implement these classes, using the older API.
http://www.questionhub.com/StackOverflow/4235318