Netezza、Teradata、DB2 并行/企业……与 Hadoop 或其他相比?

发布于 2024-08-18 05:58:26 字数 255 浏览 13 评论 0原文

我正在考虑在 Hadoop 等 Map/Reduce 解决方案之上构建一些数据仓库/查询基础设施。

然而,令我震惊的是,所有 M/R 工作都只是重复 RDBMS 人员在过去 20 年中使用并行 SQL 数据库解决的问题。并行 SQL 实现可跨节点扩展读取和写入,就像 M/R 一样,但另外还包含常规数据库(SQL、现有集成库等)的优点。

问题是:您似乎找不到这些公司的客户在网上发布太多信息。那么,这里有人有此类解决方案的经验吗,并且可以给我一些见解和/或链接吗?

I'm looking at building some data warehousing/querying infrastructure, right now on top of Map/Reduce solutions like Hadoop.

However, it strikes me that all the M/R work is just repeating what the RDBMS guys have solved for the last 20 years with parallel SQL databases. Parallel SQL implementations scale reads and writes across nodes, just like M/R, but additionally already contains the niceties from regular databases (SQL, existing integration libraries, etc).

The problem is: you don't seem to find the customers of those companies posting much online. So, does anyone here have experience with those kinds of solutions, and can give me some insight and/or links?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

黑色毁心梦 2024-08-25 05:58:26

我使用过 Netezza 和 Hadoop。并拥有 Infobright(列数据库)的二手知识。

Netezza 是一个真正的数据库并实现了 ACID 属性,这既有成本也有好处。 Netezza 正在努力通过 twinfin 新架构允许更多 M/R 代码在其表数据上运行。在该设备的早期版本中,它们支持用户定义的函数和聚合。在 SPU 上运行 Linux 并使用 Intel 处理器的新版本中,为执行更多接近数据的自定义代码打开了大门。我在 Netezza 的经历非常积极——无论是技术还是公司。

Hadoop 是纯粹的映射缩减计算。它不会产生 ACID 数据库属性的成本。所以,它确实与 Netezza 不同。根据使用模式,它可能比 Netezza 更好,而且肯定更便宜。 Hadoop 支持 Hbase 和 Hive,可以以较低的成本为您提供所需的查询便利。

我们团队的另一位开发人员评估了 Infobright,所以这是二手的,发现加载性能很差,并且某些聚合很慢。它与 Netezza 有一些相似之处(例如,netezza 中使用区域图来帮助缩小扫描范围)。 Infobright 是开源的,有社区版和受支持的企业版。

针对您的特定问题,还有更多内容可以说 - 可能超出了本论坛的范围。希望这有帮助。

I have used Netezza and Hadoop. And have second hand knowledge of Infobright, a column database.

Netezza is a true database and implements ACID properties, which has both a cost and a benefit. Netezza is moving toward allowing more M/R code to run on its table data with the new architecture of twinfin. In the previous version of the appliance they supported user-defined functions and aggregations. In the new version, which runs linux on the SPUs and uses Intel processors, the door is opening to do more custom code close to the data. My experience with Netezza has been very positive - both the technology and the company.

Hadoop is pure map-reduce computing. It doesn't incur the cost of ACID database properties. So, it's really a different beast than Netezza. Depending on the use pattern it may be better and certainly cheaper than Netezza. Hadoop had supports Hbase and Hive that may give you the query convenience you need at a lower cost.

Another developer on our team evaluated Infobright, so this is second hand, and found the load performance to be poor and some of the aggregations to be slow. It has some parallels with Netezza (e.g. zone maps are used in netezza to help narrow scan scope). Infobright is open source with both a community and a supported enterprise edition.

There is much more that can be said in context of your particular problem - probably beyond the scope of this forum. Hope this helps.

栀梦 2024-08-25 05:58:26

您尚未指定您要通过查询回答哪些问题,或者您的数据的结构。在选择使用哪种解决方案之前,您可能需要考虑这两件事。

你是对的:主要的 RDBMS 供应商都提供集群解决方案;既可以实现并行处理,又可以实现高可用性。他们使用这项技术已经有一段时间了,任何拥有大量数据的企业都可能正在使用它。当您购买($$$)产品时,如果您负担得起,他们会给您大量文档并帮助您设置(更多$$$)。

RDBMS 适合在线事务(OLTP);回答有关特定行的问题(玛丽住在哪里?);回答一些总结型问题(我们在第一季度销售了多少等)虽然可以让它们执行详细的总结性问题(我们在第一季度销售了多少,按产品、销售人员、月份细分,和区域?),您通常会开始对它们的限制征税(任何需要访问所有行的查询都会很慢)。

对于这些类型的查询,大多数企业都有一个数据仓库,将数据构建为多维“立方体”。 (参见 Cognos、Hyperion 等)。这可能适合您想要做的事情。

我没有任何使用 MapReduce 的经验,但我已经阅读了维基百科关于 使用 因此,如果您想要做的事情属于这些类别,我会继续这样做。

You haven't specified what questions you are trying to answer with your queries, or how your data is structured. Before you choose what solution to use you probably need to think about those two things.

You're correct: the major RDBMS vendors offer clustering solutions; both for parallel processing and high availability. They've had this technology for a while and any enterprise with a lot of data is probably using it. When you buy ($$$) the product they will give you lots of documentation and help you set it up (more $$$) if you can afford it.

RDBMS are good for online transactions (OLTP); answering questions about specific rows (where does Mary live?); answering some summary-type questions (how much did we sell in the first quarter, etc.) Although they can be made to perform detailed summary questions (how much did we sell in the first quarter, broken down by product, salesperson, month, and region?), you're usually starting to tax their limits (any query that needs to visit all of the rows is going to be slow).

For those types of queries most enterprises have a data warehouse that structures the data into multi-dimensional "cubes." (See Cognos, Hyperion, others). That may be appropriate for what you're trying to do.

I don't have any experience with MapReduce but I've read the wikipedia section on Uses and so if what you're trying to do falls into those categories I'd continue with it.

哥,最终变帅啦 2024-08-25 05:58:26

如果您处于快速发展的组织中,则应该使用 Teradata。我们在 Teradata 方面确实获得了很好的体验。它为您提供任何其他供应商都无法提供的可扩展性。一旦您习惯了它的 SQL 和工作方式,您就会真正欣赏 Teradata 的设计和架构。

If you are in a fast paced growing organization, you should use Teradata. We really have a good experience with Teradata. It gives you the scalability which cannot be given by any other vendor. Once you get used to its SQL and working style you will really appreciate the design and architecture of Teradata.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文