如何存储和查询非常大的数据集(超出关系数据库)
我们目前面临的问题是如何有效地存储和检索非常大的数据集(数十亿)中的数据。我们一直在使用 mysql,并优化了系统、操作系统、raid、查询、索引等,现在正在寻求继续前进。
我需要就采用什么技术来解决我们的数据问题做出明智的决定。我一直在研究 HDFS 的 map/reduce,但也听说过有关 HBase 的好消息。我忍不住想还有其他选择。是否对可用技术进行了很好的比较以及每种技术的权衡是什么?
如果您有每个链接可以分享,我也将不胜感激。
We are currently facing a problem of how to effectively store and retrieve data from very large data sets (into the billions). We have been using mysql and have optimized the system, OS, raid, queries, indexes etc, and are now looking to move on.
I need to make an informed decision about what technology to pursue to solve our data problems. I have been investigating map/reduce with HDFS, but also have heard good things about HBase. I can't help but think there are other options as well. Is there a good comparison of the technologies available and what the trade-offs of each are?
If you have links to share on each, I would appreciate that as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一个广泛的问题。我会尽力给出指示,对于每一个,您都可以查看或询问更多信息。
第一个是......传统数据库。如果数据足够有价值,您可以拥有 RAID 和优质服务器 - Oracle 可能是一个不错的解决方案,但价格昂贵。 TPC-H 是决策支持查询的行业标准基准:http://www.tpc .org/tpch/results/tpch_perf_results.asp,它是指向最佳性能结果的链接。正如您所看到的 - RDBMS 可以扩展到 TB 级的数据。
其次是HDFS + Map/Reduce + Hive形式的Hadoop。 Hive 是 MapReduce 之上的数据仓库解决方案。您可以获得一些额外的好处,例如能够以原始格式存储数据并线性扩展。您将看到的一件事是索引和运行非常复杂的查询。
第三个是 MPP——大规模并行处理数据库。它们可从数十个节点扩展到数百个节点,并具有丰富的 SQL 支持。例如 Netezza、Greenplum、Asterdata、Vertica。其中的选择并不是一件简单的事情,但如果有更精确的要求也是可以完成的。
It is broad issue. I will try to give directions, and for each one you can look or ask for further information.
First one are ...conventional DBs. If data is valuable enough that you can have RAIDs and good server - Oracle might be good, bat expensive solution. TPC-H is an industry standard benchmark for the decision support queries: http://www.tpc.org/tpch/results/tpch_perf_results.asp and it is a link to the top performance result. As you can see - RDBMS can scale to terabytes of data.
Second is Hadoop in form of HDFS + Map/Reduce + Hive. Hive is datawarehousing solution on top of MapReduce. You can get some additional benefits like capability to store data in original format and scale linearly. One of things you will looks - indexing and running very complex queries.
Third one are MPP - massive parralel processing databases. They are scalable from dozens to hundreds of nodes and have rich SQL support. Examples are Netezza, Greenplum, Asterdata, Vertica. Selection among them is not a simple task, but with more precise requirements it also can be done.