如何将一组文本作为一个整体映射到一个节点？

发布于 2024-10-12 05:38:39 字数 503 浏览 6 评论 0原文

假设我有一个包含以下数据的纯文本文件：

DataSetOne <br />
content <br />
content <br />
content <br />


DataSetTwo <br />
content <br />
content <br />
content <br />
content <br />

...等等...

我想要的是：计算每个数据集中有多少内容。例如结果应该是

<DataSetOne, 3>, <DataSetTwo, 4>

我是hadoop的初学者，我想知道是否有一种方法可以将一大块数据作为一个整体映射到一个节点。例如，将所有 DataSetOne 设置为节点 1，将所有 DataSetTwo 设置为节点 2。

有谁能给我一个如何存档的想法吗？

原文

Suppose I have a plain text file with the following data:

DataSetOne <br />
content <br />
content <br />
content <br />


DataSetTwo <br />
content <br />
content <br />
content <br />
content <br />

...and so on...

What I want to to is: count how many contents in each data set. For example the result should be

<DataSetOne, 3>, <DataSetTwo, 4>

I am a beginer to hadoop, I wonder if there is a way to map a chunk of data as a whole to a node. for example, set all DataSetOne to node 1 and all DataSetTwo to node 2.

Does anyone can give me an idea how to archive this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

霊感 2024-10-19 05:38:39

我认为简单的方法是在映射器中实现逻辑，您会记住的
什么是当前数据集并发出如下对：

（DataSetOne，内容）
（DataSetOne，内容）
（DataSetOne，内容）

（数据集二，内容）
（DataSetTwo，内容）

然后您将在reduce阶段对组进行计数。

如果性能成为一个问题，我建议考虑组合器。

回复收藏 0 原文

不忘初心 2024-10-19 05:38:39

首先，如果数据集位于单独的文件中或超出配置的块大小，则数据集将拆分为多个地图。因此，如果您有一个 128MB 的数据集，并且块大小为 64mb，hadoop 将会对该文件进行 2 块处理，并为每个文件设置 2 个映射器。
这就像 hadoop 教程中的字数统计示例。就像 David 所说，您需要将键/值对映射到 HDFS，然后减少它们。
我会这样实现：

// field in the mapper class
int groupId = 0;

@Override
protected void map(K key, V value, Context context) throws IOException,
        InterruptedException {
    if(key != groupId)
        groupId = key;
    context.write(groupId, value);
}

@Override
protected void reduce(K key, Iterable<V> values,
        Context context)
        throws IOException, InterruptedException {
    int size = 0;
    for(Value v : values){
        size++;
    }
    context.write(key, size);
}

就像大卫所说，你也可以使用组合器。组合器是简单的缩减器，用于在映射和缩减阶段之间节省资源。它们可以在配置中设置。

First of all your datasets are split for multiple maps if they are in seperate files or if they exceed the configured blocksize. So if you have one dataset of 128MB and your chunksize is 64mb hadoop will 2-block this file and setup 2 mappers for each.
This is like the wordcount example in the hadoop tutorials. Like David says you'll need to map the key/value pairs into HDFS and then reduce on them.
I would implement that like this:

// field in the mapper class
int groupId = 0;

@Override
protected void map(K key, V value, Context context) throws IOException,
        InterruptedException {
    if(key != groupId)
        groupId = key;
    context.write(groupId, value);
}

@Override
protected void reduce(K key, Iterable<V> values,
        Context context)
        throws IOException, InterruptedException {
    int size = 0;
    for(Value v : values){
        size++;
    }
    context.write(key, size);
}

Like David said aswell you could use combiner. Combiners are simple reducers and are used to save ressources between the map and reduce phase. They can be set in the configuration.

回复收藏 0 原文