Pig 和 Hive 之间的区别?为什么两者都有?

发布于 2024-09-11 19:48:13 字数 1431 浏览 4 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(19

冷血 2024-09-18 19:48:13

查看此帖子雅虎的 Pig 架构师 Alan Gates 比较了何时使用 Hive 等 SQL 而不是 Pig。他提供了一个非常令人信服的案例来说明 Pig 等过程语言(相对于声明式 SQL)的有用性及其对数据流设计者的实用性。

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.

秋意浓 2024-09-18 19:48:13

Hive 旨在吸引熟悉 SQL 的社区。它的理念是我们不需要另一种脚本语言。 Hive 支持用户选择的语言的映射和归约转换脚本(可以嵌入 SQL 子句中)。它在 Facebook 中被熟悉 SQL 的分析师以及使用 Python 编程的数据挖掘人员广泛使用。 Pig 中的 SQL 兼容性工作已经被放弃了——所以这两个项目之间的区别非常明显。

支持 SQL 语法还意味着可以与 Microstrategy 等现有 BI 工具集成。 Hive 有一个 ODBC/JDBC 驱动程序(这是一项正在进行的工作),应该可以在不久的将来实现这一点。它还开始添加对索引的支持,这应该允许支持此类环境中常见的深入查询。

最后——这与问题没有直接关系——Hive 是一个用于执行分析查询的框架。虽然它的主要用途是查询平面文件,但它没有理由不能查询其他存储。目前,Hive 可用于查询存储在 Hbase 中的数据(这是一种键值存储,类似于大多数 RDBMS 内部的键值存储),并且 HadoopDB 项目已使用 Hive 来查询联合 RDBMS 层。

Hive was designed to appeal to a community comfortable with SQL. Its philosophy was that we don't need yet another scripting language. Hive supports map and reduce transform scripts in the language of the user's choice (which can be embedded within SQL clauses). It is widely used in Facebook by analysts comfortable with SQL as well as by data miners programming in Python. SQL compatibility efforts in Pig have been abandoned AFAIK - so the difference between the two projects is very clear.

Supporting SQL syntax also means that it's possible to integrate with existing BI tools like Microstrategy. Hive has an ODBC/JDBC driver (that's a work in progress) that should allow this to happen in the near future. It's also beginning to add support for indexes which should allow support for drill-down queries common in such environments.

Finally--this is not pertinent to the question directly--Hive is a framework for performing analytic queries. While its dominant use is to query flat files, there's no reason why it cannot query other stores. Currently Hive can be used to query data stored in Hbase (which is a key-value store like those found in the guts of most RDBMSes), and the HadoopDB project has used Hive to query a federated RDBMS tier.

〃安静 2024-09-18 19:48:13

我发现这是最有帮助的(尽管它已经有一年了) - http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

它专门讨论了 Pig 与 Hive 以及他们在雅虎工作的时间和地点。我发现这非常有见地。一些有趣的注释:

关于数据集的增量更改/更新:

相反,加入新的增量数据并使用
结果与先前完全连接的结果一起是
正确的做法。这将只需要几分钟。标准数据库
操作可以在 Pig Latin 中以这种增量方式实现,
使 Pig 成为适合此用例的好工具。

关于通过流媒体使用其他工具:

猪与流媒体的集成也使研究人员可以轻松地
获取他们已经在小型计算机上调试过的 Perl 或 Python 脚本
数据集并针对巨大的数据集运行它。

关于使用 Hive 进行数据仓库:

在这两种情况下,关系模型和 SQL 都是最合适的。的确,
数据仓库一直是 SQL 的核心用例之一
它的大部分历史。它具有支持类型的正确结构
分析师想要使用的查询和工具。并且它已经在
被该领域的工具和用户使用。

Hadoop子项目Hive提供了SQL接口和关系型数据库
Hadoop 模型。 Hive 团队已开始与 BI 集成
通过 ODBC 等接口的工具。

I found this the most helpful (though, it's a year old) - http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

It specifically talks about Pig vs Hive and when and where they are employed at Yahoo. I found this very insightful. Some interesting notes:

On incremental changes/updates to data sets:

Instead, joining against the new incremental data and using the
results together with the results from the previous full join is the
correct approach. This will take only a few minutes. Standard database
operations can be implemented in this incremental way in Pig Latin,
making Pig a good tool for this use case.

On using other tools via streaming:

Pig integration with streaming also makes it easy for researchers to
take a Perl or Python script they have already debugged on a small
data set and run it against a huge data set.

On using Hive for data warehousing:

In both cases, the relational model and SQL are the best fit. Indeed,
data warehousing has been one of the core use cases for SQL through
much of its history. It has the right constructs to support the types
of queries and tools that analysts want to use. And it is already in
use by both the tools and users in the field.

The Hadoop subproject Hive provides a SQL interface and relational
model for Hadoop. The Hive team has begun work to integrate with BI
tools via interfaces such as ODBC.

彡翼 2024-09-18 19:48:13

Hive 在以下方面比 PIG 更好:分区、服务器、Web 界面和应用程序。 JDBC/ODBC 支持。

一些区别:

  1. Hive 最适合结构化数据 PIG 最适合半结构化数据

  2. Hive 用作声明式 SQLPIG 作为一种过程语言

  3. Hive 支持分区和分区PIG没有

  4. Hive 定义带有(架构)的表并将架构信息存储在数据库中。 PIG没有专用的数据库元数据

  5. Pig 还支持用于执行外连接的附加 COGROUP 功能,但 hive 不支持。但是 Hive 和 Hive 都是如此。 PIG 可以加入、订购和创建。动态排序。

Hive is better than PIG in: Partitions, Server, Web interface & JDBC/ODBC support.

Some differences:

  1. Hive is best for structured Data & PIG is best for semi structured data

  2. Hive is used as a declarative SQL & PIG as a procedural language

  3. Hive supports partitions & PIG does not

  4. Hive defines tables with (schema) and stores schema information in a database & PIG doesn't have a dedicated metadata of database

  5. Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically.

水中月 2024-09-18 19:48:13

我相信你的问题的真正答案是它们是独立的项目并且没有集中协调的目标。它们很早就处于不同的空间,随着两个项目的扩展,它们逐渐重叠。

摘自 Hadoop O'Reilly 书中的内容:

Pig:一种数据流语言和
探索非常大的环境
数据集。

Hive:分布式数据仓库

I believe that the real answer to your question is that they are/were independent projects and there was no centrally coordinated goal. They were in different spaces early on and have grown to overlap with time as both projects expand.

Paraphrased from the Hadoop O'Reilly book:

Pig: a dataflow language and
environment for exploring very large
datasets.

Hive: a distributed data warehouse

梦情居士 2024-09-18 19:48:13

您可以使用 pig/hive 查询获得类似的结果。主要区别在于理解/编写/创建查询的方法。

Pig 倾向于创建数据流:在每个小步骤中进行一些处理
Hive 为您提供类似 SQL 的语言来操作数据,因此从 RDBMS 进行转换要容易得多(对于没有 SQL 经验的人来说,Pig 会更容易)。

还值得注意的是,对于 Hive,您可以使用良好的界面来工作使用此数据(用于 HUE 的 Beeswax 或 Hive Web 界面),它还为您提供有关数据的信息(架构等)的元存储,这对于作为有关数据的中心信息非常有用。

我使用 Hive 和 Pig 来处理不同的查询(我使用 Hive 和 Pig 可以更快/更轻松地编写查询,我这样做主要是临时查询)——它们可以使用相同的数据作为输入。但目前我的大部分工作都是通过蜂蜡完成的。

You can achieve similar results with pig/hive queries. The main difference lies within approach to understanding/writing/creating queries.

Pig tends to create a flow of data: small steps where in each you do some processing
Hive gives you SQL-like language to operate on your data, so transformation from RDBMS is much easier (Pig can be easier for someone who had not earlier experience with SQL)

It is also worth noting, that for Hive you can nice interface to work with this data (Beeswax for HUE, or Hive web interface), and it also gives you metastore for information about your data (schema, etc) which is useful as a central information about your data.

I use both Hive and Pig, for different queries (I use that one where I can write query faster/easier, I do it this way mostly ad-hoc queries) - they can use the same data as an input. But currently I'm doing much of my work through Beeswax.

醉梦枕江山 2024-09-18 19:48:13

Pig 允许在管道中的任何一点加载数据和用户代码。如果数据是流数据,例如来自卫星或仪器的数据,则这一点尤其重要。

Hive 基于 RDBMS,需要首先导入(或加载)数据,然后才能对其进行处理。因此,如果您在流数据上使用 Hive,则必须不断填充存储桶(或文件)并在每个已填充存储桶上使用 Hive,同时使用其他存储桶来继续存储新到达的数据。

Pig 也使用惰性求值。它使编程变得更加容易,并且可以使用它以不同的方式分析数据,比 Hive 等类似 SQL 的语言更自由。因此,如果您确实想分析现有的一些非结构化数据中的矩阵或模式,并想对它们进行有趣的计算,那么使用 Pig,您可以走得更远,而使用 Hive,您需要其他东西来处理结果。

Pig 的数据导入速度更快,但实际执行速度比 Hive 等 RDBMS 友好语言慢。

Pig 非常适合并行化,因此它可能在数据集庞大的系统中具有优势,即在您更关心结果吞吐量而不是延迟(获取任何特定结果数据的时间)的系统中。

Pig allows one to load data and user code at any point in the pipeline. This is can be particularly important if the data is a streaming data, for example data from satellites or instruments.

Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it can be worked upon. So if you were using Hive on streaming data, you would have to keep filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep storing the newly arriving data.

Pig also uses lazy evaluation. It allows greater ease of programming and one can use it to analyze data in different ways with more freedom than in an SQL like language like Hive. So if you really wanted to analyze matrices or patterns in some unstructured data you had, and wanted to do interesting calculations on them, with Pig you can go some fair distance, while with Hive, you need something else to play with the results.

Pig is faster in the data import but slower in actual execution than an RDBMS friendly language like Hive.

Pig is well suited to parallelization and so it possibly has an edge for systems where the datasets are huge, i.e. in systems where you are concerned more about the throughput of your results than the latency (the time to get any particular datum of result).

耳根太软 2024-09-18 19:48:13

Hive 与 Pig-

Hive 是一个 SQL 接口,允许精通 sql 的用户或其他工具,如 Tableu/Microstrategy/任何其他具有 sql 接口的工具或语言。.PIG

更像是一个 ETL 管道..带有声明变量等逐步命令、循环、迭代、条件语句等。

当我想编写复杂的逐步逻辑时,我更喜欢编写 Pig 脚本而不是 hive QL。当我可以轻松地编写一条 sql 来提取我想要的数据时,我会使用 Hive。对于 hive,您需要在查询之前定义表(就像在 RDBMS 中所做的那样)

两者的目的不同,但在幕后,两者的作用相同,转换为 MapReduce 程序。此外,Apache 开源社区正在添加越来越多的内容这两个项目的功能

Hive Vs Pig-

Hive is as SQL interface which allows sql savvy users or Other tools like Tableu/Microstrategy/any other tool or language that has sql interface..

PIG is more like a ETL pipeline..with step by step commands like declaring variables, looping, iterating , conditional statements etc.

I prefer writing Pig scripts over hive QL when I want to write complex step by step logic. When I am comfortable writing a single sql for pulling the data i want i use Hive. for hive you will need to define table before querying(as you do in RDBMS)

The purpose of both are different but under the hood, both do the same, convert to map reduce programs.Also the Apache open source community is add more and more features to both there projects

若有似无的小暗淡 2024-09-18 19:48:13

在此链接中阅读 PIG 和 HIVE 之间的区别。

http://www.aptibook.com/Articles/Pig-and -hive-advantages-disadvantages-features

给出了所有方面。如果您对选择哪个感到困惑,那么您必须查看该网页。

Read the difference between PIG and HIVE in this link.

http://www.aptibook.com/Articles/Pig-and-hive-advantages-disadvantages-features

All the aspects are given. If you are in the confusion which to choose then you must see that web page.

傲性难收 2024-09-18 19:48:13
  1. Pig-latin是数据流风格,更适合软件工程师。而sql更适合习惯sql的分析人员。对于复杂的任务,对于hive你必须手动创建临时表来存储中间数据,但对于pig来说没有必要。

  2. Pig-latin适合复杂的数据结构(如小图)。 Pig中有一个称为DataBag的数据结构,它是Tuple的集合。有时您需要计算涉及多个元组的指标(元组之间存在隐藏的链接,在这种情况下我将其称为图形)。在这种情况下,很容易编写一个 UDF 来计算涉及多个元组的指标。当然在hive中也可以做到,但是没有pig那么方便。

  3. 在我看来,在 Pig 中编写 UDF 比在 Hive 中容易得多。

  4. Pig没有元数据支持(或者是可选的,将来可能会集成hcatalog)。 Hive 将表的元数据存储在数据库中。

  5. 你可以在本地环境中调试pig脚本,但hive很难做到这一点。原因是第3点。你需要在本地环境中设置hive元数据,非常耗时。

  1. Pig-latin is data flow style, is more suitable for software engineer. While sql is more suitable for analytics person who are get used to sql. For complex task, for hive you have to manually to create temporary table to store intermediate data, but it is not necessary for pig.

  2. Pig-latin is suitable for complicated data structure( like small graph). There's a data structure in pig called DataBag which is a collection of Tuple. Sometimes you need to calculate metrics which involve multiple tuples ( there's a hidden link between tuples, in this case I would call it graph). In this case, it is very easy to write a UDF to calculate the metrics which involve multiple tuples. Of course it could be done in hive, but it is not so convenient as it is in pig.

  3. Writing UDF in pig much is easier than in Hive in my opinion.

  4. Pig has no metadata support, (or it is optional, in future it may integrate hcatalog). Hive has tables' metadata stored in database.

  5. You can debug pig script in local environment, but it would be hard for hive to do that. The reason is point 3. You need to set up hive metadata in your local environment, very time consuming.

落花浅忆 2024-09-18 19:48:13

我发现下面有用的链接来探索如何以及何时使用 HIVE 和 PIG。

http://www.hadoopwizard.com/when-使用-pig-latin-与-hive-sql/

I found below useful link to explore how and when to use HIVE and PIG.

http://www.hadoopwizard.com/when-to-use-pig-latin-versus-hive-sql/

笑脸一如从前 2024-09-18 19:48:13

从链接:
http://www.aptibook.com/discuss-technical?uid=tech-hive4&question=What-kind-of-datawarehouse-application-is-suitable-for-Hive

Hive 不是一个完整的数据库。 Hadoop 和 HDFS 的设计约束和局限性限制了 Hive 的功能。

Hive 最适合数据仓库应用程序,其中

1) 分析相对静态的数据,

2) 不需要快速响应时间,

3) 当数据不快速变化时。

Hive 不提供 OLTP(在线事务处理)所需的关键功能。它更接近于 OLAP 工具,即在线分析处理。
因此,Hive 最适合数据仓库应用程序,在该应用程序中维护和挖掘大型数据集以获取见解、报告等。

From the link:
http://www.aptibook.com/discuss-technical?uid=tech-hive4&question=What-kind-of-datawarehouse-application-is-suitable-for-Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.
So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

失退 2024-09-18 19:48:13

简单来说,Pig 是一个用于创建与 Hadoop 一起使用的 MapReduce 程序的高级平台,使用 Pig 脚本我们将把大量数据处理成所需的格式。

一旦获得处理后的数据,则将处理后的数据保存在HDFS中以供后续处理以获得期望的结果。

在存储的处理数据之上,我们将应用 HIVE SQL 命令来获得所需的结果,该 hive sql 命令在内部运行 MAP Reduce 程序。

In Simpler words, Pig is a high-level platform for creating MapReduce programs used with Hadoop, using pig scripts we will process the large amount of data into desired format.

Once the processed data obtained, this processed data is kept in HDFS for later processing to obtain the desired results.

On top of the stored processed data we will apply HIVE SQL commands to get the desired results, internally this hive sql commands runs MAP Reduce programs.

孤独患者 2024-09-18 19:48:13

简而言之,要对两者进行非常高层次的概述:

1) Pig 是 hadoop 上的关系代数

2) Hive 是 hadoop 上的 SQL(比 Pig 高一级)

To give a very high level overview of both, in short:

1) Pig is a relational algebra over hadoop

2) Hive is a SQL over hadoop (one level above Pig)

滥情空心 2024-09-18 19:48:13

当我们在某种意义上使用 Hadoop 时,这意味着我们正在尝试进行大量数据处理。数据处理的最终目标是从中生成内容/报告。

它内部由 2 个主要活动组成:

1) 加载数据处理

2) 生成内容并将其用于报告等

。猪会对此有所帮助。

这有助于 ETL(我们可以使用 Pig 脚本执行 etl 操作)。

处理结果后,我们可以使用 hive 根据处理结果生成报告。

Hive:它构建在 hdfs 之上,用于仓库处理。

我们可以使用 hive 根据 Pig 生成的处理内容轻松生成临时报告。

When we are using Hadoop in the sense it means we are trying to huge data processing The end goal of the data processing would be to generate content/reports out of it.

So it internally consists of 2 prime activities:

1) Loading Data Processing

2) Generate content and use it for the reporting /etc..

Loading /Data Processing -> Pig would be helpful in it.

This helps as an ETL (We can perform etl operations using pig scripts.).

Once the result is processed we can use hive to generate the reports based on the processed result.

Hive: Its built on top of hdfs for warehouse processing.

We can generate adhoc reports easily using hive from the processed content generated from pig.

梦晓ヶ微光ヅ倾城 2024-09-18 19:48:13

HIVE 可以做哪些在 PIG 中做不到的事情?

分区可以使用 HIVE 完成,但不能在 PIG 中完成,这是一种绕过输出的方法。

PIG 可以做哪些在 HIVE 中做不到的事情?

位置引用 - 即使您没有字段名称,我们也可以使用 $0 等位置进行引用 - 第一个字段,$1 第二个字段,依此类推。

另一个根本区别是,PIG 不需要模式来写入值,但 HIVE 确实需要模式。

您可以使用 JDBC 等从任何外部应用程序连接到 HIVE,但不能使用 PIG。

注意:两者都运行在 HDFS(hadoop 分布式文件系统)之上,并且语句都转换为 MapReduce 程序。

What HIVE can do which is not possible in PIG?

Partitioning can be done using HIVE but not in PIG, it is a way of bypassing the output.

What PIG can do which is not possible in HIVE?

Positional referencing - Even when you dont have field names, we can reference using the position like $0 - for first field, $1 for second and so on.

And another fundamental difference is, PIG doesn't need a schema to write the values but HIVE does need a schema.

You can connect from any external application to HIVE using JDBC and others but not with PIG.

Note: Both runs on top of HDFS (hadoop distributed file system) and the statements are converted to Map Reduce programs.

絕版丫頭 2024-09-18 19:48:13

猪什么都吃!这意味着它可以使用非结构化数据。

Hive 需要一个模式。

Pig eats anything! Meaning it can consume unstructured data.

Hive requires a schema.

又怨 2024-09-18 19:48:13

一般来说,Pig 对于 ETL 类型的工作负载很有用。例如,您每天需要对数据进行一组转换。

当您需要运行即席查询或只想探索数据时,Hive 会发挥作用。它有时可以充当可视化层(Tableau/Qlikview)的接口。

两者都是必不可少的并且有不同的目的。

Pig is useful for ETL kind of workloads generally speaking. For example set of transformations you need to do to your data every day.

Hive shines when you need to run adhoc queries or just want to explore data. It sometimes can act as interface to your visualisation Layer ( Tableau/Qlikview).

Both are essential and serve different purpose.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文