如何让 Pig 使用 lzo 文件？

发布于 2024-12-02 11:03:48 字数 445 浏览 6 评论 0原文

因此，我在网上看到了一些教程，但每个教程似乎都说要做不同的事情。另外，它们中的每一个似乎都没有指定您是否正在尝试让事情在远程集群上工作，或者在本地与远程集群交互等等......

也就是说，我的目标只是让我的本地计算机（Mac），使 Pig 能够处理 Hadoop 集群上存在的 lzo 压缩文件，该集群已设置为可处理 lzo 文件。我已经在本地安装了 Hadoop，并且可以使用 hadoop fs -[command] 从集群获取文件。

我还已经在本地安装了pig，并在运行脚本或通过grunt 运行内容时与hadoop 集群进行通信。我可以很好地加载和使用非 lzo 文件。我的问题只是想办法加载 lzo 文件。也许我可以通过集群的 ElephantBird 实例来处理它们？我不知道，并且只在网上找到了很少的信息。

因此，任何类型的简短教程或答案都很棒，并且希望能够帮助更多的人，而不仅仅是我。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

滥情稳全场 2024-12-09 11:03:48

我最近开始使用它，并为我的同事编写了一个 wiki。以下摘录详细介绍了如何让 PIG 与 lzos 配合使用。希望这对某人有帮助！

注意：这是针对 Mac 编写的。对于其他操作系统，这些步骤几乎是相同的，这绝对应该为您提供在 Windows 或 Linux 上配置所需的信息，但您需要进行一些推断（显然，将以 Mac 为中心的文件夹更改为您想要的任何操作系统）重新使用等...）。

连接 PIG 以便能够与 LZO 一起工作

这对我来说是迄今为止最烦人和最耗时的部分 - 不是因为它很困难，而是因为网上有 50 个不同的教程，但没有一个是很有帮助的。无论如何，我为使其正常工作所做的是：

从 github 克隆 hadoop-lzo https:// github.com/kevinweil/hadoop-lzo。
编译它以获得 hadoop-lzo*.jar 和本机 *.o 库。你需要编译
这是在 64 位计算机上进行的。
将本机库复制到 $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/。
将java jar复制到$HADOOP_HOME/lib和$PIG_HOME/lib
然后配置hadoop和pig以具有属性java.library 。小路
指向 lzo 本机库。您可以在 $HADOOP_HOME/conf/mapred-site.xml 中使用以下命令执行此操作：
<前><代码><属性>;
<名称>mapred.child.env
<值>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/
现在再次运行 pig 来尝试 grunt shell，并确保一切仍然有效。如果没有，您可能搞乱了 mapred-site.xml 中的某些内容，您应该仔细检查它。
太棒了！我们快到了。您现在需要做的就是安装大象鸟。您可以从 https://github.com/kevinweil/elephant-bird 获取（克隆它).
现在，为了让象鸟工作，你需要很多先决条件。这些都列在上面提到的页面上，并且可能会发生变化，所以我不会在这里指定它们。我要提到的是，这些版本非常非常重要。如果您获得不正确的版本并尝试运行 ant，您将收到错误消息。因此，不要尝试从 brew 或 macports 获取先决条件，因为您可能会获得更新的版本。相反，只需下载 tarball 并为每个 tarball 进行构建即可。
命令：ant在elephant-bird文件夹中创建一个jar。
为了简单起见，将您需要经常注册的所有相关 jar（hadoop-lzo-xxxjar 和elephant-bird-xxxjar）移动到您可以轻松找到它们的地方。 /usr/local/lib/hadoop/... 工作得很好。
尝试一下！尝试在 grunt shell 中加载普通文件和 lzos。注册上面提到的相关 jar，尝试加载文件，将输出限制在可管理的数量，然后转储它。无论您使用的是普通文本文件还是 lzo，这都应该可以正常工作。

I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!

NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).

Hooking PIG up to be able to work with LZOs

This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:

Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.
Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile
this on a 64bit machine.
Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.
Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib
Then configure hadoop and pig to have the property java.library.path
point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:
```
<property>
    <name>mapred.child.env</name>
    <value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
</property>
```
Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.
Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).
Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.
command: ant in the elephant-bird folder in order to create a jar.
For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.
Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.