评估和比较 Hadoop 的商业智能设计注意事项

发布于 2024-11-16 05:03:08 字数 706 浏览 3 评论 0原文

我正在考虑数据仓库和商业智能的各种技术,并且发现了这个名为 Hadoop 的激进工具。 Hadoop 似乎并不完全是为了 BI 目的而构建的,但有一些参考资料表明它在该领域具有潜力。 ( http://www.infoworld.com/d/数据爆炸/hadoop-pitched-business-intelligence-488)。

尽管我从互联网上获得的信息很少,但我的直觉告诉我,hadoop 可以成为传统 BI 解决方案领域的颠覆性技术。关于这个主题的信息确实很少,因此我想在这里收集大师关于 Hadoop 作为 BI 工具与传统后端 BI 基础设施(如 Oracle Exadata、vertica 等)相比的潜力的所有想法。首先,我想问以下问题 -

  • 设计注意事项 - 使用 Hadoop 设计 BI 解决方案与传统工具有何不同?我知道它应该有所不同,因为我读到无法在 Hadoop 中创建模式。我还读到,一个主要优势将是完全消除 Hadoop 的 ETL 工具(这是真的吗?)我们需要 Hadoop + Pig + Mahout 来获得 BI 解决方案吗?

谢谢&问候!

编辑 - 分解为多个问题。将从我认为最重要的一个开始。

I am considering various technologies for data warehousing and business intelligence, and have come upon this radical tool called Hadoop. Hadoop doesn't seem to be exactly built for BI purposes, but there are references of it having potential in this field. ( http://www.infoworld.com/d/data-explosion/hadoop-pitched-business-intelligence-488).

However little information I have got from the internet, my gut tells me that hadoop can become a disruptive technology in the space of traditional BI solutions. There really is sparse information regarding this topic, and hence I wanted to gather all the Guru's thoughts here on the potential of Hadoop as a BI tool as compared to traditional backend BI infrastructure like Oracle Exadata, vertica etc. For starters, I would like to ask the following question -

  • Design Considerations - How would designing a BI solution with Hadoop be different from traditional tools? I know it should be different, as I read one cannot create schemas in Hadoop. I also read that a major advantage will be the complete elimination of ETL tools for Hadoop (is this true?) Do we need Hadoop + pig + mahout to get a BI solution??

Thanks & Regards!

Edit - Breaking down into multiple questions. Will start with the one i think most imp.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

兔小萌 2024-11-23 05:03:08

Hadoop 是成为 BI 解决方案一部分的出色工具。它本身并不是 BI 解决方案。 Hadoop 的作用是接收 Data_A 并输出 Data_B。 Bi 需要但不是有用形式的任何内容都可以使用 MapReduce 进行处理并输出有用形式的数据。无论是 CSV、HIVE、HBase、MSSQL 还是其他用于查看数据的数据。

我认为 Hadoop 应该是 ETL 工具。这就是我们使用它的目的。我们每小时处理大量日志文件并将其存储在 Hive 中,并进行每日聚合,这些聚合加载到 MSSQL 服务器中并通过可视化层进行查看。

我遇到的主要设计考虑因素是:
- 数据灵活性:您希望用户查看预先聚合的数据,还是能够灵活地调整查询并以他们想要的方式查看数据
- 速度:您希望用户等待数据多长时间? Hive(例如)速度很慢。即使数据集相当小,也需要几分钟才能生成结果。遍历的数据越大,生成结果所需的时间就越长。
- 可视化:您想使用什么类型的可视化?您想要定制很多部件还是能够使用现成的东西?您的可视化需要哪些限制和灵活性?可视化需要有多灵活和多变?

hth

更新:作为对 @Bhat 询问缺乏可视化的评论的回应...
缺乏使我们能够有效利用 HBase 中存储的数据的可视化工具是重新评估我们的解决方案的一个主要因素。我们将原始数据存储在Hive中,并将数据预先聚合并存储在HBase中。为了利用它,我们必须编写一个自定义连接器(完成这部分)和可视化层。我们研究了我们能够生产什么以及什么可以商业化,然后走上了商业路线。
我们仍然使用 Hadoop 作为处理博客的 ETL 工具,这非常棒。我们只是将经过 ETL 处理的原始数据发送到商业大数据数据库,该数据库将在我们的设计中取代 Hive 和 HBase。

Hadoop 确实无法与 MSSQL 或其他数据仓库存储相比。 Hadoop不做任何存储(忽略HDFS),它做数据处理。运行 MapReduce(Hive 所做的)将比 MSSQL(或类似的)慢。

Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.

I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.

The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?

hth

Update: As a response to @Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.

Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).

药祭#氼 2024-11-23 05:03:08

Hadoop 非常适合存储可以表示事实表的巨大文件。可以通过将代表表的各个文件放入单独的目录中来对这些表进行分区。 Hive 理解此类文件结构并允许像分区表一样查询它们。您可以通过 Hive 以 SQL 查询的形式向 Hadoop 数据表达 BI 问题,但您仍然需要偶尔编写和运行 MapReduce 作业。

Hadoop is very well suited for storing colossal files that can represent fact tables. These tables can be partitioned by placing individual files representing the table into separate directories. Hive understands such file structures and allows to query them like partitioned tables. You can phrase your BI questions to the Hadoop data in the form of SQL queries via Hive, but you will still need to write and run an occasional MapReduce job.

烛影斜 2024-11-23 05:03:08

从业务角度来看,如果您有大量低价值数据,则应该考虑 Hadoop。在很多情况下,RDBMS / MPP 解决方案并不具有成本效益。
如果您的数据不是结构化的(例如 HTML),您还应该考虑将 Hadoop 作为一个重要的选择。

From business perspective, you should consider Hadoop if you have a lot of low-value data. There are many cases when RDBMS / MPP solutions are not cost effective.
You also should consider Hadoop as a serious option if your data is not structured (HTMLs for example).

痴意少年 2024-11-23 05:03:08

我们正在为大数据/Hadoop 的 BI 工具创建一个比较矩阵
http://hadoopillustration.com/hadoop_book/BI_Tools_For_Hadoop.html

这是正在进行中的工作并且会喜欢任何输入。

(免责声明:我是这本在线书籍的作者)

We are creating a comparison matrix for BI tools for Big Data / Hadoop
http://hadoopilluminated.com/hadoop_book/BI_Tools_For_Hadoop.html

It is work in progress and would love any input.

(disclaimer : I am the author of this online book)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文