我正在考虑替换一堆 Python ETL 脚本,这些脚本对大量数据执行每晚/每小时的数据摘要和统计收集。
我想要实现的是
- 鲁棒性 - 失败的作业/步骤应该自动重新启动。在某些情况下,我想改为执行恢复步骤。
- 该框架必须能够从崩溃中恢复。我想这里需要一些坚持。
- 监控 - 我需要能够监控作业/步骤的进度,最好查看有关性能的历史记录和统计数据。
- 可追溯性 - 我必须能够了解执行的状态
- 手动干预 - 很高兴...能够从 API / UI / 命令行启动/停止/暂停作业。
- 简单性——当我介绍替代品时,我不想看到同事们生气的表情……有一个简单且易于理解的 API 是一个要求。
当前的脚本执行以下操作:
我的想法是用 spring-batch 替换脚本。我还研究了 Scriptella,但我认为对于这种情况来说它太“简单”。
因为我在 Spring-Batch 上看到了一些不好的感觉(主要是旧帖子),所以我希望在这里得到一些意见。 spring-batch 和 Hive 集成我也没有看到太多,比较麻烦。
I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data.
What I'd like to achieve is
- Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead.
- The framework must be able to recover from crashes. I guess some persistence would be needed here.
- Monitoring - I need to be able to monitor the progress of jobs / steps, and preferably see history and statistics with regards to the performance.
- Traceability - I must be able to understand the state of the executions
- Manual intervention - nice to have... being able to start / stop / pause a job from an API / UI / command line.
- Simplicity - I prefer not to get angry looks from my colleagues when I introduce the replacement... Having a simple and easy to understand API is a requirement.
The current scripts do the following:
- Collect text logs from many machines, and push them into Hadoop DFS. We may use Flume for this step in the future (see http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/).
- Perform Hive summary queries on the data, and insert (overwrite) to new Hive tables / partitions.
- Extract the new summaries data into files, and load (merge) into MySql tables. This is data needed later for on-line reports.
- Perform additional joins on the newly added MySql data (from MySql tables), and update the data.
My idea is to replace the scripts with spring-batch. I also looked into Scriptella, but I believe it is too 'simple' for this case.
since I saw some bad vibes on Spring-Batch (mostly old posts) I'm hoping to get some inputs here. I also haven't seen much about spring-batch and Hive integration, which is troublesome.
发布评论
评论(3)
如果您想留在 Hadoop 生态系统中,我强烈建议您查看 Oozie 来自动化您的工作流程。我们 (Cloudera) 提供了Oozie 的打包版本,您可以使用它来开始使用。请参阅我们最近的博客文章了解更多细节。
If you want to stay within the Hadoop ecosystem, I'd highly recommend checking out Oozie to automate your workflow. We (Cloudera) provide a packaged version of Oozie that you can use to get started. See our recent blog post for more details.
为什么不使用 JasperETL 或 人才?似乎是适合这项工作的工具。
Why not use JasperETL or Talend? Seems like the right tool for the job.
我已经使用过 Cascading 相当多了,发现它非常令人印象深刻:
Cascading
它是一个 M/R 抽象层,并在 Hadoop 上运行。
I've used Cascading quite a bit and found it be quite impressive:
Cascading
It is a M/R abstraction layer, and runs on Hadoop.