Hadoop可以读取任意密钥的二进制文件

发布于 2024-12-06 18:00:30 字数 210 浏览 0 评论 0原文

看起来Hadoop MapReduce需要文本或二进制文本中的键值对结构。 实际上,我们可能需要将文件分割成多个块来进行处理。但钥匙可能是 分布在整个文件中。一个键后面跟着一个值可能不是一个明确的界限。有没有可以读取这种类型的二进制文件的InputFileFormatter?我不想使用MapReduce和MapReduce。这会降低性能并违背使用 MapReduce 的目的。 有什么建议吗?谢谢,

It looks like Hadoop MapReduce requires a key value pair structure in the text or binary text.
In reality we might have files to be split into chunks to be processed. But the keys may be
spread across the file. It may not be a clear cut that one key followed by one value. Is there any InputFileFormatter that can read such type of binary files? I don't want to use Map Reduce and Map Reduce. That will slow down the performance and defeat the purpose of using map reduce.
Any suggestions? Thanks,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

心奴独伤 2024-12-13 18:00:30

根据Hadoop:权威指南

FileInputFormats 定义的逻辑记录通常不能完全适合 HDFS
块。例如,TextInputFormat 的逻辑记录是行,它们会交叉
HDFS 边界常常存在。这与您的功能无关
程序——例如,线路不会丢失或损坏——但值得了解,
因为它确实意味着数据本地映射(即在同一主机上运行的映射)
作为他们的输入数据)将执行一些远程读取。这导致的轻微开销是
通常不显着。

如果文件被 HDFS 在边界之间分割,那么 Hadoop 框架将处理它。但如果您手动分割文件,则必须考虑边界。

实际上,我们可能需要将文件分割成多个块来进行处理。但密钥可能分布在整个文件中。一键后一值可能不是一个明确的划分。

什么情况下,我们可以看看解决方法吗?

According to the Hadoop : The Definitive Guide

The logical records that FileInputFormats define do not usually fit neatly into HDFS
blocks. For example, a TextInputFormat’s logical records are lines, which will cross
HDFS boundaries more often than not. This has no bearing on the functioning of your
program—lines are not missed or broken, for example—but it’s worth knowing about,
as it does mean that data-local maps (that is, maps that are running on the same host
as their input data) will perform some remote reads. The slight overhead this causes is
not normally significant.

If the file is split by HDFS between boundaries, then Hadoop framework will take care of it. But if you split the file manually, then boundaries have to be taken into consideration.

In reality we might have files to be split into chunks to be processed. But the keys may be spread across the file. It may not be a clear cut that one key followed by one value.

What's the scenario, we can look at a workaround for this?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文