用于数据分析的 NoSql 或 MySQL
我们有一个集群(hadoop、pig),其数据量为 350Gb(每周增加几 GB)。
所有这些数据都需要可供分析使用。
我们有一个具有星型模式的 Msyql 解决方案(仅将部分数据加载到此)。但
令人担忧的是,这一举措能延伸到什么程度?
我应该像 Hive 这样的 NoSQL 来进行数据分析吗?
我读了这篇文章 http://anders.com/cms/282/Distributed .Data/Hadoop/Hbase/Hive
大数据有多大,我什么时候应该放弃 MySQL? Mysql的结构僵化会带来问题吗?
目前数据只有几GB(在MySQL中),但它肯定会增长。 MySQL集群怎么样?
我应该走这条路吗?
We have a cluster (hadoop, pig) which churns data 350Gb (growing couple of GB a week).
All these data need to be made available for Analytics.
We have a Msyql solution with star schema(only parts of data is loaded on to this). But
concern is how far one can stretch this ?
Should I be looking at NoSQL like Hive for data analytics ??
I read this article http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive
How big is big Data, and when should I be looking away from MySQL?
Will the structural rigidness of Mysql cause problems ?
Currently the data is only few GB(in MySQL), But it certainly will grow.
How about MySQL clustering ??
Should I be going down this path at all ??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您内部有 MySQL 专家吗?如果是的话,当然=>只需创建并扩展 MySQL 集群即可。这个解决方案的唯一问题不是它是 MySQL,也不是它不是 NoSQL =>从字面上看,这是因为它需要专家进行设置,并始终在您身边,以防需要更改。但你猜怎么着 =>与 Map/Reduce SQL 模拟相比,SQL 对于分析来说更加更好、更简单。
Oracle 可能会成为 MySQL 解决方案稍后出现的问题。因此,请确保您了解 MySQL 的哪些功能可以免费使用,哪些功能需要付费。
如果您没有内部有MySQL专家,或者您不想付费请一位专家,那么您绝对可以转向NoSQL。但这并不意味着您不需要 NoSQL 产品专业知识,但对于 NoSQL 解决方案来说,将 X 节点配置和运行为单个系统是一个极其简单和自然的过程。
例如,在 Riak 和其他一些 NoSQL 野兽中,大多数分发复杂性都由产品解决,而无需您执行任何操作 =>真的就是这么简单。
使用 NoSQL 所付出的代价是失去 SQL(想想良好的聚合功能)和一致性,这是最终的,如果你严格进行分析,对你来说,一致性可能根本不是代价。
作为回报,您将获得非常自然的大数据处理、容错和 还有更多。
如果您在 Hadooooxyz 空间,并且可以付费,请查看 Hadapt,它承诺 5倍 Hive 性能。
Do you have MySQL gurus in house? If yes, sure => just create and grow that MySQL cluster. The only problem with this solution is not that it is MySQL, and it is not that it is not a NoSQL => it is literally because it requires an expert to set it up and always be there by your side in case it needs to be changed. But guess what => SQL is MUCH better and simpler for analytics, than a map/reduc'ish SQL simulation.
Something that can become a problem later with MySQL solution is Oracle. So make sure you understand what features of MySQL you can use for free, and what features you would have to pay for.
If you do not have a MySQL expert in house, or you would not like to pay for one, you can definitely turn to NoSQL. It does not mean that you would not need a NoSQL product expertise though, but to configure and run X nodes as a single system is an extremely simple and natural process for NoSQL solutions.
For example, in Riak, and a couple of other NoSQL beasts, most of the distribution complexities are solved by the product without you needing to do anything at all => it really is that simple.
The price you pay with NoSQL is losing SQL (think about nice aggregating features) and consistency, which is eventual, and if you strictly doing analytics, for you, consistency may not be a price at all.
In return you get a very natural Big Data handling, fault tolerance and much more.
If you are in Hadooooxyz space, and you are okay to pay, take a look at Hadapt, which promises 5 times Hive performance.
这个问题当然已经有好几个月了,但是……我最近遇到了 InfiniDB,它将 MySQL 前端放在一个高度可扩展的、基于 MapReduce 的大数据引擎上,专门用于分析。它可能是这个问题的一个解决方案——原则上它应该会出现并且需要很少的管理和很少的代码更改。支持在一台机器上纵向扩展或多台服务器上的扩展...
The question is of course now many months old, but... I recently came across InfiniDB, which puts a MySQL front end on a highly scalable, MapReduce-based Big Data engine aimed specifically at analytics. It may be a solution for this problem-- in principle it should drop in and require very little administration and few code changes. Scaling up on one box or out on multiple servers is supported...
当您开始遇到类似以下比较问题中概述的问题时,您就会切换:https://dba.stackexchange.com/questions/5/what-are-the-differences- Between-nosql-and-a-traditional-rdbms
除了也就是说,除了一般建议之外,回答这个问题有点困难,因为你没有提出你想要解决的具体问题(例如缩放、读取速度、要求 100% 一致性的问题等)。
You switch when you start having the kinds of problems outlined in something like this comparative question: https://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms
Other than that, it's a little difficult to answer the question beyond general advice, because you don't pose a specific problem that you are trying to solve (e.g. scaling, read speed, the problems with requiring 100% consistency, etc.).
InfiniDB 不是免费的。
请查看 http://code.google.com/p/shard-query
这是就像在分片的无共享数据库集上的 Map-Reduce 一样。非常适合 STAR 模式。将事实表分片到 N 个节点上,并在每个服务器上复制维度表。
您可以查看此博客文章以获取更多信息和性能测试结果:
http ://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/
仅供参考:我是 Shard-Query 的作者。
InfiniDB is not free.
Check out http://code.google.com/p/shard-query
This is like Map-Reduce over a sharded shared-nothing set of databases. Works great for STAR schemas. Shard the fact table over N nodes and duplicate the dimension tables on each server.
You can check out this blog post for more info and performance testing results:
http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/
FYI: I'm the author of Shard-Query.