如何在 Amazon Elastic Mapreduce 之上使用 Hive 来处理 Amazon Simple DB 中的数据?
我在 Amazon Simple DB 域中有大量数据。我想在 Elastic Map Reduce(在 hadoop 之上)上启动 Hive,并以某种方式从 simpledb 导入数据,或者连接到 simpledb 并对其运行 hiveql 查询。我在导入数据时遇到问题。有什么指点吗?
I have a lot of data in an Amazon Simple DB Domain. I want to start Hive on Elastic Map Reduce (on top of hadoop) and somehow, either import data from simpledb or, connect to simpledb and run hiveql queries on it. I have having issues importing the data. Any pointers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
作为流式 hadoop 作业的输入,您可以有 simpleDB 的一系列 select 语句。
例如,您的输入可以包含(以不太详细的形式):
然后您将实现一个执行以下转换的映射器脚本:
input_select_statement =>;执行选择语句=> 使用流式处理
这将非常容易,因为您可以使用任何您喜欢的语言的任何库,而不必担心实现任何复杂的 Hadoop java 内容。
希望这有帮助。
(最简单的方法是在本地运行一个脚本,其功能与上面相同,但将结果加载到 s3 中。我每晚运行一个这样的脚本来处理我们的许多数据库数据)
As input to a streaming hadoop job you could have a sequence of select statements for simpleDB.
for example, your input could contain (in a less verbose form):
Then you would implement a mapper script that performed the following transformation:
input_select_statement => execute_select_statement => output_results
This would be super easy using streaming because you could use any library for any language you like and not have to worry about implementing any of the complicated Hadoop java stuff.
Hope this helps.
(the hacky way to do it would be to have a single script that you run locally that does the same as above, but loads the results into s3. I run a script like that nightly for a lot of our database data)