Hbase和Hive集成有什么好处
最近看到一篇博客,作者提到了Hbase和Hive的集成。这是否可能,如果可以的话,使用两者的优势是什么(在性能和可扩展性方面)。如果我错了,请纠正我。
Recently, I came across a blog where the author mentioned about integrating Hbase and Hive. Will this be possible and if so what is the advantage of using both(in terms of performance and scalability). Kindly correct me if I went wrong.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为这是可能的,但设置起来并不简单——也许 CDH3 Final 发布时会包括集成。
优点:Hive 查询优于 hbase。考虑连接以及对 HBase 数据进行聚合和简单操作的简单方法。
为什么不直接使用 Hive 而不用 HBase? HBase 为您提供可扩展的存储基础架构,使数据保持在线状态。 StumbleUpon 将 HBase 用于他们的实时网站。 Hive 不是实时查询引擎,因此其数据存储不能用于类似目的。 Hive over HBase 为您带来两全其美的好处。
I think it will be possible but not trivial to set up for a bit -- maybe CDH3 final will include integration when it comes out.
Advantages: Hive queries over hbase. Think joins and a easy way to do aggregates and simple operations on your HBase data.
Why not just use Hive and not bother with HBase? HBase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses HBase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
目前有一个补丁可以在 HBase 和 Hive 之间加载数据。您可以在这里找到它:
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
实现开销看起来相当高。
在 HBase 表上运行扫描并保存到外部文件,然后将其导入 Hive 进行数据操作可能会更容易。 (这也相当麻烦,但如果您定期这样做,则可以编写脚本。)这就是我目前正在研究的解决方案。我会让你知道进展如何。
至于为什么选择 HBase 而不是 Hive,它们实际上不能互换。 HBase 是一种构建在 Hadoop 之上的高度可扩展的数据存储,对数据分析的支持很少。另一方面,Hive 并不用于在生产环境中存储数据,而是使对大量数据运行特定查询变得非常容易。
There is currently a patch which enables loading data between HBase and Hive. You can find it here:
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
The implementation overhead looks to be pretty high.
It might be easier to run a scan on the HBase table and save to an external file then import it into Hive for data manipulation. (This is also pretty cumbersome, but if you are doing it on a regular basis can be scripted.) This is currently the solution that I am currently working on. I'll let you know how it goes.
As for why you would choose HBase over Hive, they aren't really interchangeable. HBase is meant as a highly scalable data store built on top of Hadoop, with little support for data analysis. Hive on the other hand isn't used for storing data in a production environment, but rather makes it very easy to run specific queries over large amounts of data.