评估和比较 Hadoop 的商业智能设计注意事项

发布于 2024-11-16 05:03:08 字数 706 浏览 3 评论 0原文

我正在考虑数据仓库和商业智能的各种技术，并且发现了这个名为 Hadoop 的激进工具。 Hadoop 似乎并不完全是为了 BI 目的而构建的，但有一些参考资料表明它在该领域具有潜力。 ( http://www.infoworld.com/d/数据爆炸/hadoop-pitched-business-intelligence-488）。

尽管我从互联网上获得的信息很少，但我的直觉告诉我，hadoop 可以成为传统 BI 解决方案领域的颠覆性技术。关于这个主题的信息确实很少，因此我想在这里收集大师关于 Hadoop 作为 BI 工具与传统后端 BI 基础设施（如 Oracle Exadata、vertica 等）相比的潜力的所有想法。首先，我想问以下问题 -

设计注意事项 - 使用 Hadoop 设计 BI 解决方案与传统工具有何不同？我知道它应该有所不同，因为我读到无法在 Hadoop 中创建模式。我还读到，一个主要优势将是完全消除 Hadoop 的 ETL 工具（这是真的吗？）我们需要 Hadoop + Pig + Mahout 来获得 BI 解决方案吗？

谢谢&问候！

编辑 - 分解为多个问题。将从我认为最重要的一个开始。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兔小萌 2024-11-23 05:03:08

Hadoop 是成为 BI 解决方案一部分的出色工具。它本身并不是 BI 解决方案。 Hadoop 的作用是接收 Data_A 并输出 Data_B。 Bi 需要但不是有用形式的任何内容都可以使用 MapReduce 进行处理并输出有用形式的数据。无论是 CSV、HIVE、HBase、MSSQL 还是其他用于查看数据的数据。

我认为 Hadoop 应该是 ETL 工具。这就是我们使用它的目的。我们每小时处理大量日志文件并将其存储在 Hive 中，并进行每日聚合，这些聚合加载到 MSSQL 服务器中并通过可视化层进行查看。

我遇到的主要设计考虑因素是：
- 数据灵活性：您希望用户查看预先聚合的数据，还是能够灵活地调整查询并以他们想要的方式查看数据
- 速度：您希望用户等待数据多长时间？ Hive（例如）速度很慢。即使数据集相当小，也需要几分钟才能生成结果。遍历的数据越大，生成结果所需的时间就越长。
- 可视化：您想使用什么类型的可视化？您想要定制很多部件还是能够使用现成的东西？您的可视化需要哪些限制和灵活性？可视化需要有多灵活和多变？

hth

更新：作为对 @Bhat 询问缺乏可视化的评论的回应...
缺乏使我们能够有效利用 HBase 中存储的数据的可视化工具是重新评估我们的解决方案的一个主要因素。我们将原始数据存储在Hive中，并将数据预先聚合并存储在HBase中。为了利用它，我们必须编写一个自定义连接器（完成这部分）和可视化层。我们研究了我们能够生产什么以及什么可以商业化，然后走上了商业路线。
我们仍然使用 Hadoop 作为处理博客的 ETL 工具，这非常棒。我们只是将经过 ETL 处理的原始数据发送到商业大数据数据库，该数据库将在我们的设计中取代 Hive 和 HBase。

Hadoop 确实无法与 MSSQL 或其他数据仓库存储相比。 Hadoop不做任何存储（忽略HDFS），它做数据处理。运行 MapReduce（Hive 所做的）将比 MSSQL（或类似的）慢。

Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.

I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.

The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?

hth

Update: As a response to @Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.

Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).

回复收藏 0 原文

药祭#氼 2024-11-23 05:03:08

Hadoop 非常适合存储可以表示事实表的巨大文件。可以通过将代表表的各个文件放入单独的目录中来对这些表进行分区。 Hive 理解此类文件结构并允许像分区表一样查询它们。您可以通过 Hive 以 SQL 查询的形式向 Hadoop 数据表达 BI 问题，但您仍然需要偶尔编写和运行 MapReduce 作业。

回复收藏 0 原文