是“采用MapReduce模型”吗? = 可扩展性的通用答案?
我一直在尝试理解 MapReduce 概念并将其应用到我当前的情况中。我的情况是什么?嗯,我这里有一个 ETL 工具,其中数据转换发生在源数据源和目标数据源(数据库)之外。因此,源数据源纯粹用于提取,目的地用于加载。
因此,今天的这种转换行为大约需要 X 小时才能处理一百万条记录。我想解决这样一个场景:我拥有 10 亿条记录,但我希望在相同的 X 小时内完成工作。因此,我的产品需要根据数据规模进行扩展(添加更多商用机器)。正如您所看到的,我只担心将产品的转换功能分发到不同机器的能力,从而利用所有这些机器的 CPU 能力。
我开始寻找选项,并遇到了 Apache Hadoop,最终遇到了 MapReduce 的概念。我非常成功地快速设置了 Hadoop,没有在集群模式下遇到问题,并且也很高兴运行字数统计演示。很快,我意识到为了实现我自己的 MapReduce 模型,我必须将我的产品的转换功能重新定义为 MAP 和 REDUCE 功能。
这时候麻烦就开始了。我读了一份《Hadoop:权威指南》,我了解到 Hadoop 的许多常见用例都面临着以下场景:
- 非结构化数据,并且想要执行聚合/排序/或类似操作。
- 非结构化文本,需要进行挖掘
- 等!
这是我的场景,我从数据库中提取数据并加载到数据库(具有结构化数据),我的唯一目的是以可靠的方式让更多的 CPU 发挥作用,然后分发我的转换。重新定义我的转换以适应 Map 和 Reduce 模型本身就是一个巨大的挑战。所以这是我的问题:
你在 ETL 中使用过 Hadoop吗? 场景?如果是的话,可以具体一点 关于你如何处理 MapReducing 你的转变?你用过吗 Hadoop 纯粹是为了利用额外的资源 中央处理器功率?
MapReduce 概念是 分布式的通用答案 计算?还有其他同样的吗 好的选择吗?
- 我的理解是 MapReduce适用于大型 数据集 排序/分析/分组/计数/聚合/等, 我的理解正确吗?
I have been trying to understand the MapReduce concept and apply it to my current situation. What is my situation? Well, I have an ETL tool here, in which data transformation happens outside of source and destination data sources (databases). Hence,the source data source is purely used for extract and destination for load.
So, this act of transformation today, say takes about X hours for a million records. I would like to address a scenario where I would have a billion records, but I would want the work done in the same X hours. So, here is the need, for my product to scale out (adding more commodity machines) based on the scale of data. As you can see, I am only worried about the ability of distributing my product's transformation functionality to different machines, there by, leveraging CPU power from all these machines.
I started looking for options and I came across Apache Hadoop and then eventually the concept of MapReduce. I was pretty successful in settin up Hadoop quickly without running into issues in cluster mode and was happy to run a wordcount demo too. Soon, I realized that for implementing my own MapReduce model, I would have to redefine my product's transformation functionality into MAP and REDUCE functions.
Here's when trouble began. I read a copy of Hadoop: Definitive Guide, and I understood that many of the common use cases of Hadoop are in scenarios where one is faced with:
- Unstructed data and one would like to perform aggregation/ sort/ or something of that kind.
- Unstrucuted text and there is a need to perform mining
- etc!
Here is my scenario where I extract from a database and load to a database (which has structured data), and my sole purpose is about bringing in more CPUs into play, in a reliable manner, and there by distribute my transformation. And redefining my transformation to fit a Map and Reduce model makes it a huge challenge in itself. So here are my questions:
Have you used Hadoop in ETL
scenarios? If yes, could be specific
about how you handled MapReducing of
your transformation? Have you used
Hadoop purely for leveraging extra
CPU power?Is MapReduce concept the
universal answer to distributed
computing? Are there other equally
good options?- My understanding is
that MapReduce applies to large
dataset for
sorting/analytics/grouping/counting/aggregation/etc,
is my understading correct?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您想在许多系统上扩展处理问题,您必须做两件事:
如果存在依赖关系,那么这些将成为水平可扩展性的限制。
因此,如果您从关系模型开始,那么主要障碍是您拥有关系这一事实。拥有这些关系对于关系数据库来说是一笔巨大的财富,但在尝试横向扩展时却是一个痛苦。
从关系部分转到独立部分的最简单方法是跳转并将数据非规范化为包含所有内容的记录,并且集中于您想要进行处理的部分。然后,您可以将它们分布在一个巨大的集群上,并在处理完成后使用结果。
如果你不能进行这样的跳跃,那么你就有麻烦了。
那么回到你的问题:
# 你在 ETL 场景中使用过 Hadoop 吗?
是的,输入是 Apache 日志文件,加载和转换包括解析、规范化和过滤这些日志行。结果无法放入普通的 RDBMS 中!
# MapReduce 概念是分布式计算的通用答案吗?还有其他同样好的选择吗?
MapReduce 是一个非常简单的处理模型,对于您能够拆分为许多较小的 100% 独立部分的任何处理问题都非常有用。 MapReduce 模型非常简单,据我所知,任何可以拆分为独立部分的问题都可以编写为一系列 MapReduce 步骤。
然而:值得注意的是,目前 Hadoop 只能完成面向批量的处理。如果您想要“实时”处理,那么您目前运气不佳。
目前我不知道有更好的模型可以实际实现。
# 我的理解是,MapReduce 适用于大型数据集的排序/分析/分组/计数/聚合等,我的理解正确吗?
是的,这是最常见的应用。
If you want to scale-out a processing problem over a lot of systems you must do two things:
If there are dependencies then these will be the limit in your horizontal scalability.
So if you are starting from a relational model then the main obstruction is the fact that you have relationships. Having these relationships is a great asset in relational databases but is a pain in the ... when trying to scale-out.
The simplest way to go from relational to independent parts is to make a jump and de-normalize your data into records that have everything in them and are focussed around the part you want to do the processing around. Then you can disribute them over a huge cluster and after the processing has been completed you use the results.
If you cannot do such a jump you're in trouble.
So coming back to your questions:
# Have you used Hadoop in ETL scenarios?
Yes, the input being Apache logfiles and the loading and transformation consisted of parsing, normalizing and filtering these loglines. The result wan't put in a normal RDBMS!
# Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?
MapReduce is a very simple processing model that will work great for any processing problem you are able to split into a lot of smaller 100% independent parts. The MapReduce model is so simple that as far as I know any problem that can be split into independent parts can be written as series of mapreduce steps.
HOWEVER: It is important to note that at this moment only BATCH oriented processing can be done with Hadoop. If you want "realtime" processing you are currently out of luck.
I don't know of a better model at this moment that an actual implementation exists for.
# My understanding is that MapReduce applies to large dataset for sorting/analytics/grouping/counting/aggregation/etc, is my understading correct?
Yep, that is the most common application.
MapReduce 是“某些”类问题的“一个”解决方案。它并不能解决所有分布式系统问题 - 将大型 TPS 系统想象为银行、电信或电信信号中的系统 - MR 可能无效。但对于非实时数据处理,MR 表现出色,您可能会考虑将其用于大规模 ETL。
MapReduce is "one" solution for "some" class of problems. It does not solve all the distributed systems problems - think about large TPS systems as the ones in banks or telecoms or telco signaling - there MR might be ineffective. But for the non real-time data processing MR performs awesome and you might consider it for massive ETL.
我无法回答#1,因为我没有在 ETL 场景中使用过 MapReduce。然而,我可以说MapReduce并不是分布式计算的“通用答案”;它不是分布式计算的“通用答案”。它是处理某些类型情况的有用工具,其中数据以某种方式构建。把它想象成一个哈希表;对于某些情况非常有用,但从任何术语定义来看都不是“终极算法”。
我个人的理解是MapReduce对于大量“非结构化”数据特别有用;也就是说,它对于强加某种结构很有用(基本上,有效地对大型非结构化数据集提供“一阶”操作)。然而,对于非常大且相对“紧密绑定”的数据集(即不同数据元素之间的强关联),(以我的理解)这不是一个很好的解决方案。
I cannot answer #1, as I haven't used MapReduce in ETL scenarios. However, I can say that MapReduce is not an "universal answer" for distributed computing; it's a useful tool for handling certain types of situations, where data is structured in a certain way. Think of it like a hashtable; very useful for certain situations, but not an "ultimate algorithm" by any definition of terms.
My personal understanding is that MapReduce is particularly useful for large quantities of "understructured" data; that is, it's useful for imposing some structure (basically, effectively providing a "first order" operation on large unstructured datasets). However, for datasets that are very large and relatively "tightly bound" (i.e. strong association between disparate data elements), it's (in my understanding) not a great solution.