如何使用 PIG 加载文件夹中的每个文件?
我有一个每天创建的文件文件夹,所有文件都存储相同类型的信息。我想制作一个脚本来加载最新的 10 个,将它们联合起来,然后在它们上运行一些其他代码。由于 Pig 已经有一个 ls 方法,我想知道是否有一种简单的方法可以获取最后 10 个创建的文件,并使用相同的加载器和选项以通用名称加载它们。我猜它看起来会是这样的:
REGISTER /usr/local/lib/hadoop/hadoop-lzo-0.4.13.jar;
REGISTER /usr/local/lib/hadoop/elephant-bird-2.0.5.jar;
FOREACH file in some_path:
file = LOAD 'file'
USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\\t')
AS (i1, i2, i3);
I have a folder of files created daily that all store the same type of information. I'd like to make a script that loads the newest 10 of them, UNIONs them, and then runs some other code on them. Since pig already has an ls method, I was wondering if there was a simple way for me to get the last 10 created files, and load them all under generic names using the same loader and options. I'm guessing it would look something like:
REGISTER /usr/local/lib/hadoop/hadoop-lzo-0.4.13.jar;
REGISTER /usr/local/lib/hadoop/elephant-bird-2.0.5.jar;
FOREACH file in some_path:
file = LOAD 'file'
USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\\t')
AS (i1, i2, i3);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

发布评论
评论(3)
我喜欢以上两种方法。只是想为 oozie 爱好者多一种选择。 oozie 中的 Java 操作在由“oozie.action.output.properties”配置的位置吐出一个文件,Pig 操作将其传递给 Pig 脚本。与上面的 2 相比,这绝对不是一个优雅的解决方案。我在 oozie 中使用 java 调度配置嵌入式 Pig 时遇到了麻烦,所以我不得不采用这个解决方案。
<workflow-app xmlns='uri:oozie:workflow:0.1' name='java-wf'>
<start to='java1' />
<action name='java1'>
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>org.apache.oozie.test.MyTest</main-class>
<arg>${outputFileName}</arg>
<capture-output/>
</java>
<ok to="pig1" />
<error to="fail" />
</action>
<action name='pig1'>
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.pig</script>
<param>MY_VAR=${wf:actionData('java1')['PASS_ME']}</param>
</pig>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Pig failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
这不是我能够立即完成的事情,而是可以使用某种包装脚本或帮助程序脚本(bash、perl 等)在脚本之外完成的事情。如果您编写一个名为
last10.sh
的脚本,它将输出最后 10 个文件,以逗号分隔:对于最近 10 个文件,类似这样的操作应该可以解决问题:
您可以这样做:
然后,在你的 Pig 脚本,这样做:
Pig 加载单独的文件,如果它们像这样用逗号分隔。这相当于做:
This is not something I've been able to do out of the box, and is something that can be done outside of the script with some sort of wrapper script or helper script (bash, perl, etc.). If you write a script, called
last10.sh
, that would output your last 10 files, comma separated:Something like this should do the trick for the most recent 10 files:
you could do:
Then, in your pig script, do:
Pig loads up the separate files if they are comma separated like this. This would be equivalent to doing: