netezza 是如何工作的?它与 Hadoop 相比如何?
想要了解 Netezza
或 Hadoop
是否是实现以下目的的正确选择:
从多个大小相当大(有时超过 1 GB)的在线源中提取源文件。
从提要中清理、过滤、转换和计算更多信息。
在不同维度上生成指标,类似于数据仓库多维数据集的工作方式,并且
帮助网络应用程序更快地访问最终数据/指标使用 SQL 或任何其他标准机制。
Want to understand whether Netezza
or Hadoop
is the right choice for the below purposes:
Pull feed files from several online sources of considerable size at times more than a GB.
Clean, filter, transform and compute further information from the feeds.
Generate metrics on different dimensions akin to how data warehouse cubes do it, and
Aid webapps to access the final data/metrics faster using SQL or any other standard mechanisms.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
工作原理:
当数据加载到设备中时,它会智能地分隔 108 个 SPU 中的每个表。
通常,
硬盘是计算机中最慢的部分。想象一下 108 个这样的设备同时旋转,加载一个小的
桌子上的一块。这就是 Netezza 实现每小时 500 GB 加载时间的方法。
将表的一部分加载并存储到每个SPU(集成电路卡上的计算机)上后,每个
分析列以获得描述性统计数据,例如最小值和最大值。这些值是
存储在 108 个 SPU 中的每一个上,而不是索引,索引需要时间来创建、更新和占用
不必要的空间。
想象一下您的环境无需创建索引。
当需要查询数据时,设备内部的主计算机会查询 SPU 以查看哪些数据
其中包含所需的数据。
只有包含适当数据的 SPU 才会返回信息,因此
更少的信息通过网络移动到商业智能/分析服务器。
对于连接数据来说,它会变得更好。
设备将数据分布在多个 SPU 的多个表中
通过钥匙。每个SPU包含多个表的部分数据。它在每个 SPU 上本地连接每个表的部分内容
仅返回本地结果。所有“本地结果”均在内阁内部进行汇总,然后
作为查询结果返回到商业智能/分析服务器。该方法还有助于
速度故事。
所有这一切的关键是“减少网络上的数据移动”。设备仅返回数据
需要通过组织的 1000/100 MB 网络返回到商业智能/分析服务器。
这与商业智能/分析软件通常使用的传统处理非常不同
从数据库中提取大部分数据并在自己的服务器上进行处理。数据库执行以下操作
确定所需的数据,将较小的子集结果返回给商业智能/分析
服务器。
备份和冗余
要了解如何设置数据和系统以实现几乎 100% 的正常运行时间,重要的是要了解
内部设计。它使用每个 400 GB 磁盘的外部、最快的三分之一部分来存储数据和
检索。磁盘的三分之一存储描述性统计数据,另外三分之一存储热数据备份
其他 SPU。每个设备机柜还包含 4 个额外的 SPU,用于 108 个设备中任何一个的自动故障转移
SPU。
摘自http://www2.sas.com
How it works:
As the data is loaded into the Appliance, it intelligently separates each table across the 108 SPUs.
Typically,
the hard disk is the slowest part of a computer. Imagine 108 of these spinning up at once, loading a small
piece of the table. This is how Netezza achieves a 500 Gigabyte an hour load time.
After a piece of the table is loaded and stored on each SPU (computer on an integrated circuit card), each
column is analyzed to gain descriptive statistics such as minimum and maximum values. These values are
stored on each of the 108 SPUs, instead of indexes, which take time to create, updated and take up
unnecessary space.
Imagine your environment without the need to create indexes.
When it is time to query the data, a master computer inside of the Appliance queries the SPUs to see which
ones contain the data required.
Only the SPUs that contain appropriate data return information, therefore
less movement of information across the network to the Business Intelligence/Analytics Server.
For joining data, it gets even better.
The Appliance distributes data in multiple tables across multiple SPUs
by a key. Each SPU contains partial data for multiple tables. It joins parts of each table locally on each SPU
returning only the local result. All of the ‘local results’ are assembled internally in the cabinet and then
returned to the Business Intelligence/Analytics Server as a query result. This methodology also contributes
to the speed story.
The key to all of this is ‘less movement of data across the network’. The Appliance only returns data
required back to the Business Intelligence/Analytics server across the organization’s 1000/100 MB network.
This is very different from traditional processing where the Business Intelligence/Analytics software typically
extracts most of the data from the database to do its processing on its own server. The database does the
work to determine the data needed, returning a smaller subset result to the Business Intelligence/Analytics
server.
Backup And Redundancy
To understand how the data and system are set up for almost 100% uptime, it is important to understand
the internal design. It uses the outer, fastest, one-third part of each 400-Gigabyte disk for data storage and
retrieval. One-third of the disk stores descriptive statistics and the other third stores hot data back up of
other SPUs. Each Appliance cabinet also contains 4 additional SPUs for automatic fail over of any of the 108
SPUs.
Took from http://www2.sas.com
我会考虑将批处理 ETL 过程和进一步的 SQL 请求分开设计。我认为
以下数字对于评估决策非常重要:
a) 您每天想要处理多少行数据?
b) 您想在系统中存储多少行数据?
c) RDBMS 数据集的大小是多少。
d) 您将使用什么类型的 SQL?我在这里的意思是 - 是否有临时 SQL 或精心计划的报告。另一个问题 - 两个大桌子之间是否需要乔恩。
有了上述问题的解答,才有可能给出更好的答案。
例如,当您确实需要连接非常大的表时,我会考虑使用 Netezza;如果您需要存储 TB 级的数据,我会考虑使用 hadoop。
I would consider to separate design of the batch ETL process and further SQL requests. I think
the following numbers are important to evaluate the decisions:
a) How much row data you want to process daily?
b) How much row data you want to store in the system?
c) What will be size of the RDBMS dataset.
d) What kind of SQLs you are going to have? Here I mean - are there ad-hoc SQLs or well planned reports. Another questions - do you need jons between two large tables.
With above questions answered it will be possible to give better answers.
For example, I would consider Netezza as option when you do need joins of very large tables, and hadoop - if you need to store terabytes of data.
从您的回答来看,Netezza 可能更适合您的需求。它可以很好地处理即席查询,并且其软件的最新版本内置了对汇总和多维数据集的支持。此外,Netezza 的数据规模为 TB,因此您应该能够处理可用的数据。
It would seem from your answers that Netezza may be more suited to your needs. It handles ad-hoc queries very well and the newest version of their software has built in support for rollups and cubes. Also, Netezza operates on the scale of terabytes of data so you should be more than able to process the data you have available.
如果您正在处理 ELT 场景,您必须加载大量文件并稍后对其进行处理,例如过滤、转换并将其加载到传统数据库进行分析,那么您可以使用 hadoop 加载文件,然后使用 Netezza 作为目标暂存或数据仓库区。使用hadoop,您可以将所有文件放入HDFS,然后使用ETL工具读取以进行转换、过滤等,或使用Hive SQL将查询写入这些文件中的数据。然而,基于hadoop的数据仓库HIve不支持更新,也不支持所有的SQL语句。因此,最好从 HDFS 读取这些文件,应用过滤器、转换并将结果加载到传统数据仓库设备(例如 netezza)以编写多维数据集查询。
如果您每天将 GB 的数据加载到带有登陆、暂存和集市区域的 netezza,那么您很可能最终会使用大量空间。在这种情况下,您可以将登陆空间设置为 hadoop,然后将临时区域和集市区域设置为 netezza。如果您的查询很简单并且您没有进行非常复杂的过滤等或更新源,您可能可以使用 hadoop 管理一切。
总而言之,hadoop 非常适合处理大量数据,但不支持传统数据仓库的所有功能。
您可以查看此链接以查看差异:
http://dwbitechguru.blogspot.ca /2014/12/如何选择-hadoop-vs-netezza.html
If you are dealing with ELT scenario where you have to load huge volumes of files and process it later like filter, transform and load it to tranditional databases for analytics then you can use hadoop to load the files and then Netezza as the target staging or data warehouse area. With hadoop you can put all your files into HDFS and then read using ETL tool to tranform, filter, etc or use Hive SQL to write your query the data in those files. However, hadoop based data warehouse HIve does not support updates and does not support all the SQL statements. Hence, it is better to read those files from HDFS, apply filters, transformation and load the result to traditional data warehouse appliance such as netezza to write your queries for cubes.
If you are daily loading GB of data to netezza with landing, staging and mart area then most likely you will end up using a lot of space. In this scenario you can make your landing space to be on hadoop and then make your staging and mart areas to be netezza. If you queries are simple and you are not doing very complex filtering etc or updates to source may be you can manage everything with hadoop.
To conclude, hadoop is ideal for huge volumes of data but does not support all the functionality of a traditional data warehouse.
You can check out this link to see the differences:
http://dwbitechguru.blogspot.ca/2014/12/how-to-select-between-hadoop-vs-netezza.html