Hadoop 和 MySQL 集成
我们希望在我们的系统上实施 Hadoop 以提高其性能。
该过程的工作原理如下: Hadoop将从MySQL数据库收集数据然后对其进行处理。 然后输出将被导出回 MySQL 数据库。
这是一个好的实施吗?这会提高我们系统的整体性能吗? 有什么要求?以前是否这样做过?一个好的教程确实会有帮助。
谢谢
We would like to implement Hadoop on our system to improve its performance.
The process works like this:
Hadoop will gather data from MySQL database then process it.
The output will then be exported back to MySQL database.
Is this a good implementation? Will this improve our system's overall performance?
What are the requirements and has this been done before? A good tutorial would really help.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
Sqoop 是一个旨在将数据从关系数据库导入 Hadoop 的工具
https://github.com/cloudera/sqoop /wiki/
以及有关它的视频 http://www.cloudera.com/blog/2009/12/hadoop-world-sqoop-database-import-for-hadoop/
Sqoop is a tool designed to import data from relational databases into Hadoop
https://github.com/cloudera/sqoop/wiki/
and a video about it http://www.cloudera.com/blog/2009/12/hadoop-world-sqoop-database-import-for-hadoop/
Hadoop 用于基于批处理的作业,主要针对大型半结构化数据。批处理从某种意义上说,即使是最短的作业也只有几分钟的量级。您面临什么样的性能问题?是基于数据转换还是报告。根据情况,这种架构可能会有所帮助,也可能会使事情变得更糟。
Hadoop is used for batch based jobs mostly on large sized semi structured data.. Batch in the sense even the shortest jobs is in the order of magnitudes of minutes. What kind of performance problem you are facing? Is it based on data transformations or reporting. Depending on that this architecture may help or make things worse.
正如 Joe 所提到的,Sqoop 是 Hadoop 生态系统中导入和导出数据的绝佳工具。以及 SQL 数据库,例如 MySQL。
如果您需要更复杂的 MySQL 集成,包括过滤或转换,那么您应该使用集成框架或集成套件来解决此问题。看看我的演示文稿“Hadoop 之外的大数据 - 如何集成您的所有数据" 了解有关如何使用的更多信息开源集成框架和与 Hadoop 的集成套件。
As mentioned by Joe, Sqoop is a great tool of the Hadoop ecosystem to import and export data from and to SQL databases such as MySQl.
If you need more complex integration of MySQL including e.g. filtering or tranformation, then you should use an integration framework or integration suite for this problem. Take a look at my presentation "Big Data beyond Hadoop - How to integrate ALL your data" for more information about how to use open source integration frameworks and integration suites with Hadoop.
尽管这不是常规的 hadoop 用法。在以下场景中可能有意义:
a)如果您有很好的方法将数据分区到输入中(例如现有分区)。
b) 每个分区的处理量比较重。我会给出每个分区至少 10 秒的 CPU 时间数。
如果满足这两个条件 - 您将能够应用任何所需的 CPU 功率来进行数据处理。
如果您正在进行简单的扫描或聚合 - 我认为您将不会获得任何结果。另一方面 - 如果您要在每个分区上运行一些 CPU 密集型算法 - 那么您的收益确实会很大。
我还要提到一个单独的情况 - 如果您的处理需要大量数据排序。我不认为 MySQL 擅长对数十亿条记录进行排序。 Hadoop 会做到这一点。
Altough it is not a regular hadoop usage. It migh make sense in following scenario:
a) If you have good way to partition your data into the inputs (like existing partitioning).
b) The processing of each partition is relatively heavy. I would give the number of at least 10 seconds of CPU time per partition.
If both conditions are met - you will be able to apply any desired amount of CPU power to make your data processing.
If your are doing simple scan or aggregation - I think your will not gain anything. On other hand - if your are going to run some CPU intensive algorithms on each partition - then indeed your gain can be significant.
I would also mention a separate case- if your processing require massive data sorting. I do not think that MySQL will be good in sorting billions of records. Hadoop will do it.
我同意赛伊的观点。我仅在需要时才将 Hadoop 与 MySql 结合使用。我将表导出为 CSV 并将其上传到 HDFS 以更快地处理数据。如果您想保留处理后的数据,则必须编写一个单减速器作业,该作业将执行某种批量插入操作以提高插入性能。
但是这实际上取决于您想做什么样的事情。
I agree with Sai. I'm using Hadoop with MySql only when needed. I export the table into CSV and upload it to HDFS to process data more quickly. If you want to persist your processed data, you will have to write a single-reducer job that will do some kind of batchinserts to improve the performance of insertion.
BUT that really depends on what kind of things you want to do.