Pig Latin:从某个日期范围加载多个文件(目录结构的一部分)

发布于 2024-09-15 11:24:13 字数 4331 浏览 13 评论 0原文

我有以下场景 -

Pig 版本使用 0.70

示例 HDFS 目录结构:

/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>

正如您在上面列出的路径中看到的,目录名称之一是日期戳。

问题:我想加载日期范围(例如从 20100810 到 20100813)的文件。

我可以将日期范围的“从”和“到”作为参数传递给 Pig 脚本,但我该如何制作在 LOAD 语句中使用这些参数。我能够执行以下操作

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);

: 以下操作适用于 hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813}

但是当我在 Pig 脚本中尝试使用 LOAD 进行相同操作时,它会失败。如何使用传递给 Pig 脚本的参数来加载某个日期范围内的数据?

错误日志如下:

Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
        at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
        at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
        ... 14 more



Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
        at org.apache.pig.PigServer.openIterator(PigServer.java:521)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
        at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

我是否需要使用 Python 等高级语言来捕获范围内的所有日期戳并将它们作为逗号分隔列表传递给 LOAD?

干杯

I have the following scenario-

Pig version used 0.70

Sample HDFS directory structure:

/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>

As you can see in the paths listed above, one of the directory names is a date stamp.

Problem: I want to load files from a date range say from 20100810 to 20100813.

I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);

The following works with hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813}

But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?

Error log follows:

Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
        at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
        at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
        ... 14 more



Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
        at org.apache.pig.PigServer.openIterator(PigServer.java:521)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
        at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

cheers

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

漫漫岁月 2024-09-22 11:24:13

正如zjffdu所说,路径扩展是由shell完成的。解决问题的一种常见方法是简单地使用 Pig 参数(无论如何,这是使脚本更可重用的好方法):

shell:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

script.pig:

temp = LOAD '$input' USING SomeLoader() AS (...);

As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):

shell:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

script.pig:

temp = LOAD '$input' USING SomeLoader() AS (...);
月竹挽风 2024-09-22 11:24:13

Pig 正在使用 hadoop 文件 glob 实用程序(而不是 shell 的 glob 实用程序)处理您的文件名模式。 Hadoop 的记录这里。如您所见,hadoop 不支持范围的“..”运算符。在我看来,您有两个选择 - 要么手动写出 {date1,date2,date2,...,dateN} 列表,如果这是一个罕见的用例,则可能是这样去,或者编写一个包装脚本来为您生成该列表。对于您选择的脚本语言来说,从日期范围构建这样的列表应该是一项微不足道的任务。对于我的应用程序,我使用了生成的列表路由,并且它工作正常(CHD3 分发)。

Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

多情癖 2024-09-22 11:24:13

当我尝试在脚本中创建文件 glob 然后将其作为参数传递给 Pig 脚本时遇到问题时,我遇到了这个答案。

当前的答案均不适用于我的情况,但我确实找到了一个可能对我有帮助的一般答案。

就我而言,shell 扩展正在发生,然后将其传递到脚本中 - 导致 Pig 解析器出现完全问题,这是可以理解的。

因此,只需将 glob 括在双引号中即可保护它不被 shell 扩展,并将其按原样传递到命令中。

不会起作用:

$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6

会起作用

$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6

我希望这可以为某人节省一些痛苦和痛苦。

i ran across this answer when i was having trouble trying to create a file glob in a script and then pass it as a parameter into a pig script.

none of the current answers applied to my situation, but i did find a general answer that might be helpful here.

in my case, the shell expansion was happening and then passing that into the script - causing complete problems with the pig parser, understandably.

so by simply surrounding the glob in double-quotes protects it from being expanded by the shell, and passes it as is into the command.

WON'T WORK:

$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6

WILL WORK

$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6

i hope this saves someone some pain and agony.

一瞬间的火花 2024-09-22 11:24:13

因此,因为这有效:

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader()

但这不起作用:

temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader()

但是如果您想要一个跨越 300 天的日期范围,并将完整列表传递给 LOAD 至少可以说并不优雅。我想出了这个并且它有效。

假设您要加载从 2012-10-08 到今天 2013-02-14 的数据,您可以做的就是

temp = LOAD '/user/training/test/{201210*,201211*,201212,2013*}' USING SomeLoader()

在之后进行过滤

filtered = FILTER temp BY (the_date>='2012-10-08')

So since this works:

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader()

but this does not work:

temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader()

but if you want a date range that spans say 300 days and passing a full list to LOAD is not elegant to say the least. I came up with this and it works.

Say you want to load data from 2012-10-08 to today 2013-02-14, what you can do is

temp = LOAD '/user/training/test/{201210*,201211*,201212,2013*}' USING SomeLoader()

then do a filter after that

filtered = FILTER temp BY (the_date>='2012-10-08')
清泪尽 2024-09-22 11:24:13
temp = LOAD '/user/training/test/2010081*/*' USING SomeLoader() AS (...);
load 20100810~20100819 data
temp = LOAD '/user/training/test/2010081{0,1,2}/*' USING SomeLoader() AS (...);
load 20100810~2010812 data

如果变量位于文件路径的中间,请连接子文件夹名称或对所有文件使用“*”。

temp = LOAD '/user/training/test/2010081*/*' USING SomeLoader() AS (...);
load 20100810~20100819 data
temp = LOAD '/user/training/test/2010081{0,1,2}/*' USING SomeLoader() AS (...);
load 20100810~2010812 data

if the variable is in the middle of file path, concate subfolder name or use '*' for all files.

夏天碎花小短裙 2024-09-22 11:24:13

我发现这个问题是由linux shell引起的。 Linux shell 会帮助你扩展

 {20100810..20100812} 

  20100810 20100811 20100812, 

你实际运行的命令

bin/hadoop fs -ls 20100810 20100811 20100812

,但是在 hdfs api 中,它不会帮助你扩展表达式。

I found this problem is caused by linux shell. Linux shell will help you expand

 {20100810..20100812} 

to

  20100810 20100811 20100812, 

then you actually run command

bin/hadoop fs -ls 20100810 20100811 20100812

But in the hdfs api, it won't help you to expand the expression.

客…行舟 2024-09-22 11:24:13

感谢戴夫·坎贝尔。
后面的一些答案是错误的,因为他们获得了一些选票。

以下是我的测试结果:

  • 有效

    • pig -f test.pig -param input="/test_{20120713,20120714}.txt"
      • 表达式中“,”前后不能有空格
    • pig -f test.pig -param input="/test_201207*.txt"
    • pig -f test.pig -param input="/test_2012071?.txt"
    • pig -f test.pig -param input="/test_20120713.txt,/test_20120714.txt"
    • pig -f test.pig -param input=/test_20120713.txt,/test_20120714.txt
      • 表达式中“,”前后不能有空格
  • 不起作用

    中“,”之前或之后不能有空格

    • pig -f test.pig -param input="/test_{20120713..20120714}.txt"
    • pig -f test.pig -param input=/test_{20120713,20120714}.txt
    • pig -f test.pig -param input=/test_{20120713..20120714}.txt

Thanks to dave campbell.
Some of the answer beyond are wrong since they got some votes.

Following is my test result:

  • Works

    • pig -f test.pig -param input="/test_{20120713,20120714}.txt"
      • Cannot have space before or after "," in the expression
    • pig -f test.pig -param input="/test_201207*.txt"
    • pig -f test.pig -param input="/test_2012071?.txt"
    • pig -f test.pig -param input="/test_20120713.txt,/test_20120714.txt"
    • pig -f test.pig -param input=/test_20120713.txt,/test_20120714.txt
      • Cannot have space before or after "," in the expression
  • Doesn't Work

    • pig -f test.pig -param input="/test_{20120713..20120714}.txt"
    • pig -f test.pig -param input=/test_{20120713,20120714}.txt
    • pig -f test.pig -param input=/test_{20120713..20120714}.txt
谈下烟灰 2024-09-22 11:24:13

我是否需要使用Python等高级语言来捕获范围内的所有日期戳并将它们作为逗号分隔列表传递给LOAD?

也许您不这样做 - 这可以使用自定义加载 UDF 来完成,或者尝试重新考虑您的目录结构(如果您的范围大部分是静态的,这将很有效)。

另外:Pig接受参数,也许这会对您有所帮助(也许您可以执行从一天加载数据并将其合并到结果集的函数,但我不知道是否可能)

编辑:可能编写简单的python或bash脚本生成日期(文件夹)列表是最简单的解决方案,您只需将其传递给 Pig,这应该可以正常工作

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

Probably you don't - this can be done using custom Load UDF, or try rethinking you directory structure (this will work good if your ranges are mostly static).

additionally: Pig accepts parameters, maybe this would help you (maybe you could do function that will load data from one day and union it to resulting set, but I don't know if it's possible)

edit: probably writing simple python or bash script that generates list of dates (folders) is the easiest solution, you than just have to pass it to Pig, and this should work fine

幸福还没到 2024-09-22 11:24:13

对于 Romain 的回答,如果您只想参数化日期,shell 将像这样运行:

pig -param input="$(echo {20100810..20100812} | tr ' ' ,)" -f script.pig

Pig:

temp = LOAD '/user/training/test/{$input}' USING SomeLoader() AS (...);

请注意引号。

To Romain's answer, if you want to just parameterize the date, the shell will run like this:

pig -param input="$(echo {20100810..20100812} | tr ' ' ,)" -f script.pig

pig:

temp = LOAD '/user/training/test/{$input}' USING SomeLoader() AS (...);

Please note the quotes.

蒗幽 2024-09-22 11:24:13

Pig 支持 hdfs 的全球状态,

所以我认为 pig 可以处理该模式
/user/training/test/{20100810,20100811,20100812}

您可以粘贴错误日志吗?

Pig support globe status of hdfs,

so I think pig can handle the pattern
/user/training/test/{20100810,20100811,20100812},

could you paste the error logs ?

她如夕阳 2024-09-22 11:24:13

这是我用来生成日期列表的脚本,然后将此列表放入 Pig 脚本参数中。非常棘手,但对我有用。

例如:

DT=20180101
DT_LIST=''
for ((i=0; i<=$DAYS; i++))
do
    d=$(date +%Y%m%d -d "${DT} +$i days");
    DT_LIST=${DT_LIST}$d','
done

size=${#DT_LIST}
DT_LIST=${DT_LIST:0:size-1}


pig -p input_data=xxx/yyy/'${DT_LIST}' script.pig

Here's a script I'm using to generate a list of dates, and then put this list to pig script params. Very tricky, but works for me.

For example:

DT=20180101
DT_LIST=''
for ((i=0; i<=$DAYS; i++))
do
    d=$(date +%Y%m%d -d "${DT} +$i days");
    DT_LIST=${DT_LIST}$d','
done

size=${#DT_LIST}
DT_LIST=${DT_LIST:0:size-1}


pig -p input_data=xxx/yyy/'${DT_LIST}' script.pig

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文