Hadoop - Map-Reduce 任务如何知道要处理文件的哪一部分？

发布于 2024-12-27 19:18:55 字数 1333 浏览 5 评论 0原文

我已经开始学习 hadoop，目前我正在尝试处理结构不太好的日志文件 - 因为我通常用于 M/R 键的值通常位于文件的顶部（一旦）。所以基本上我的映射函数将该值作为键，然后扫描文件的其余部分以聚合需要减少的值。因此，[假] 日志可能如下所示：

## log.1
SOME-KEY
2012-01-01 10:00:01 100
2012-01-02 08:48:56 250
2012-01-03 11:01:56 212
.... many more rows

## log.2
A-DIFFERENT-KEY
2012-01-01 10:05:01 111
2012-01-02 16:46:20 241
2012-01-03 11:01:56 287
.... many more rows

## log.3
SOME-KEY
2012-02-01 09:54:01 16
2012-02-02 05:53:56 333
2012-02-03 16:53:40 208
.... many more rows

我想累积每个键的第三列。我有一个由多个节点组成的集群运行这个作业，所以我被几个问题困扰：

1. 文件分布

鉴于hadoop的HDFS在64Mb块中工作（默认情况下），并且每个文件都分布在集群上，我可以确定吗正确的密钥将与正确的数字相匹配？也就是说，如果包含密钥的块位于一个节点中，并且包含同一密钥（同一日志的不同部分）的数据的块位于不同的机器上 - M/R 框架如何匹配这两个节点（如果根本）？

2. 块分配

对于如上所述的文本日志，每个块的截止点是如何确定的？是在一行结束之后，还是恰好在 64Mb（二进制）处？这还重要吗？这与我的#1 相关，我关心的是正确的值与整个集群上的正确的键相匹配。

3. 文件结构

M/R 处理的最佳文件结构（如果有）是什么？如果典型的日志如下所示，我可能不会那么担心：

A-DIFFERENT-KEY 2012-01-01 10:05:01 111
SOME-KEY        2012-01-02 16:46:20 241
SOME-KEY        2012-01-03 11:01:56 287
A-DIFFERENT-KEY 2012-02-01 09:54:01 16
A-DIFFERENT-KEY 2012-02-02 05:53:56 333
A-DIFFERENT-KEY 2012-02-03 16:53:40 208
...

但是，日志很大，将它们转换为上述格式会非常昂贵（时间）。我应该担心吗？

4. 作业分配

作业分配是否使得只有一个 JobClient 处理整个文件？相反，所有 JobClient 之间的键/值如何协调？再次，我试图保证我的可疑日志结构仍然会产生正确的结果。

原文

I've been starting to learn hadoop, and currently I'm trying to process log files that are not too well structured - in that the value I normally use for the M/R key is typiclly found at the top of the file (once). So basically my mapping function takes that value as key and then scans the rest of the file to aggregate the values needed to be reduced. So a [fake] log might look like this:

## log.1
SOME-KEY
2012-01-01 10:00:01 100
2012-01-02 08:48:56 250
2012-01-03 11:01:56 212
.... many more rows

## log.2
A-DIFFERENT-KEY
2012-01-01 10:05:01 111
2012-01-02 16:46:20 241
2012-01-03 11:01:56 287
.... many more rows

## log.3
SOME-KEY
2012-02-01 09:54:01 16
2012-02-02 05:53:56 333
2012-02-03 16:53:40 208
.... many more rows

I want to accumulate the 3rd column for each key. I have a cluster of several nodes running this job, and so I was bothered by several issues:

1. File Distribution

Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?

2. Block Assignment

For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

3. File structure

What is the optimal file structure (if any) for M/R processing? I'd probably be far less worried if a typical log looked like this:

A-DIFFERENT-KEY 2012-01-01 10:05:01 111
SOME-KEY        2012-01-02 16:46:20 241
SOME-KEY        2012-01-03 11:01:56 287
A-DIFFERENT-KEY 2012-02-01 09:54:01 16
A-DIFFERENT-KEY 2012-02-02 05:53:56 333
A-DIFFERENT-KEY 2012-02-03 16:53:40 208
...

However, the logs are huge and it would be very costly (time) to convert them to the above format. Should I be concerned?

4. Job Distribution

Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

林空鹿饮溪 2025-01-03 19:18:55

考虑到 hadoop 的 HDFS 在 64Mb 块中工作（默认情况下），并且每个文件都分布在集群上，我能否确定正确的密钥将与正确的数字相匹配？也就是说，如果包含密钥的块位于一个节点中，并且包含同一密钥（同一日志的不同部分）的数据的块位于不同的机器上 - M/R 框架如何匹配这两个节点（如果根本）？

键和值的映射方式取决于 InputFormat 类。 Hadoop 有几个InputFormat 类，也可以定义自定义InputFormat 类。

如果 FileInputFormat<使用 /a> 则映射器的键是文件偏移量，值是输入文件中的行。在大多数情况下，文件偏移量被忽略，输入文件中的一行值由映射器处理。因此，默认情况下，日志文件中的每一行都将是映射器的一个值。

可能存在这样的情况：日志文件中的相关数据（如 OP 中的数据）可能会跨块分割，每个块将由不同的映射器处理，而 Hadoop 无法将它们关联起来。一种方法是使用 FileInputFormat#isSplitable 方法让单个映射器处理完整的文件。如果文件太大，这不是一个有效的方法。

对于如上所述的文本日志，每个块的截止点是如何确定的？是在一行结束之后，还是恰好在 64Mb（二进制）处？这还重要吗？这与我的#1 相关，我关心的是正确的值与整个集群上的正确的键相匹配。

默认情况下，HDFS 中的每个块大小正好是 64MB，除非文件大小小于 64MB 或默认块大小已被修改，否则不考虑记录边界。输入中行的某些部分可以在一个块中，而其余部分可以在另一个块中。 Hadoop 理解记录边界，因此即使记录（行）被分割成多个块，它仍然只能由单个映射器处理。为此，可能需要从下一个块传输一些数据。

分配的作业是否只有一个 JobClient 处理整个文件？相反，所有 JobClient 之间的键/值如何协调？再次，我试图保证我的可疑日志结构仍然会产生正确的结果。

不太清楚查询是什么。建议阅读一些教程并返回查询。

Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?

How the keys and the values are mapped depends on the InputFormat class. Hadoop has a couple of InputFormat classes and custom InputFormat classes can also be defined.

If FileInputFormat is used then the key to the mapper is the file off-set and the value is the line in the input file. In most of cases the file off-set is ignored and the value which is a line in the input file is processed by the mapper. So, by default each line in the log file will be a value to to the mapper.

There might be case where related data in a log file as in the OP might be split across blocks, each block will be processed by a different mapper and Hadoop cannot relate them. One way it to let a single mapper process the complete file by using the FileInputFormat#isSplitable method. This is not an efficient approach if the file size is too large.

For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

Each block in HDFS by default is exactly 64MB size unless the file size is less than 64MB or the default block size has been modfied, record boundaries are not considered. Some part of the line in the input can be in one block and the rest in another. Hadoop understands record boundaries, so even if a record (line) is split across blocks, it will be still processed by a single mapper only. For this some data transfer might be required from the next block.

Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.

Not exactly clear what the query is. Would suggest to go through some tutorials and get back with queries.

回复收藏 0 原文

~没有更多了~