Apache Hive 中的自动化测试
我即将开始一个使用 Apache Hadoop/Hive 的项目,该项目将涉及一组 hive 查询脚本,以便为各种下游应用程序生成数据源。这些脚本似乎是某些单元测试的理想候选者 - 它们代表了我的数据存储和客户端应用程序之间 API 合同的履行,因此,编写给定的起始数据集的预期结果是微不足道的。我的问题是如何运行这些测试。
如果我使用 SQL 查询,我可以使用 SQLlite 或 Derby 之类的工具来快速启动测试数据库、加载测试数据并针对它们运行一组查询测试。不幸的是,我不知道 Hive 有任何此类工具。目前,我最好的想法是让测试框架启动一个 hadoop 本地实例并针对它运行 Hive,但我以前从未这样做过,而且我不确定它是否会起作用,或者是否是正确的路径。
另外,我对关于我正在做的是单元测试还是集成测试的迂腐讨论不感兴趣 - 我只需要能够证明我的代码有效。
I am about to embark on a project using Apache Hadoop/Hive which will involve a collection of hive query scripts to produce data feeds for various down stream applications. These scripts seem like ideal candidates for some unit testing - they represent the fulfillment of an API contract between my data store and client applications, and as such, it's trivial to write what the expected results should be for a given set of starting data. My issue is how to run these tests.
If I was working with SQL queries, I could use something like SQLlite or Derby to quickly bring up test databases, load test data and run a collection of query tests against them. Unfortunately, I am unaware of any such tools for Hive. At the moment, my best thought is to have the test framework bring up a hadoop local instance and run Hive against that, but I've never done that before and I'm not sure it will work, or be the right path.
Also, I'm not interested in a pedantic discussion about if what I am doing is unit testing or integration testing - I just need to be able to prove my code works.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Hive 有特殊的独立模式,专为测试目的而设计。在这种情况下,它可以在没有hadoop的情况下运行。我认为这正是您所需要的。
有一个文档链接:
http://wiki.apache.org/hadoop/Hive/HiveServer
Hive has special standalone mode, specifically design for the testing purposes. In this case it can run without hadoop. I think it is exactly what you need.
There is a link to the documentation:
http://wiki.apache.org/hadoop/Hive/HiveServer
我作为支持大数据和分析平台的团队的一员,我们也遇到了此类问题。
我们已经搜索了一段时间,发现了两个非常有前途的工具: https://github.com/klarna/HiveRunner https://github.com/bobfreitas/HadoopMiniCluster
HiveRunner 是一个构建在JUnit 用于测试 Hive 查询。它启动一个独立的 HiveServer,并以内存中的 HSQL 作为元存储。有了它,你可以存根表、视图、模拟样本等。
虽然 Hive 版本有一些限制,但我绝对推荐它
希望它对你有帮助 =)
I'm working as part of a team to support a big data and analytics platform, and we also have this kind of issue.
We've been searching for a while and we found two pretty promising tools: https://github.com/klarna/HiveRunner https://github.com/bobfreitas/HadoopMiniCluster
HiveRunner is a framework built on top of JUnit to test Hive Queries. It starts a standalone HiveServer with in memory HSQL as the metastore. With it you can stub tables, views, mock samples, etc.
There are some limitations on Hive versions though, but I definitely recommend it
Hope it helps you =)
您可能还需要考虑以下博客文章,其中描述了使用自定义实用程序类和 ant 进行自动化单元测试: http://dev.bizo.com/2011/04/hive-unit-testing.html
You may also want to consider the following blog post which describes automating unit testing using a custom utility class and ant: http://dev.bizo.com/2011/04/hive-unit-testing.html
我知道这是一个旧线程,但以防万一有人遇到它。我已经跟进了整个迷你集群和hive 测试,发现 MR2 和 YARN 的情况发生了变化,但是是好的变化。我整理了一篇文章和 github 存储库来提供一些帮助:
http: //www.lopakalogic.com/articles/hadoop-articles/hive-testing/
希望有帮助!
I know this is an old thread, but just in case someone comes across it. I have followed up on the whole minicluster & hive testing, and found that things have changed with MR2 and YARN, but in a good way. I have put together an article and github repo to give some help in it:
http://www.lopakalogic.com/articles/hadoop-articles/hive-testing/
Hope it helps!