Pig:更改输出文件NAME的格式
我正在运行一个弹性 MapReduce 管道,该管道使用多个 Pig 脚本的输出。本质上,pig 脚本的输出存储在 S3 上的某个位置,由于数据量巨大,因此创建的输出文件被命名为part-xxxxx。
现在我的问题是,管道中的步骤之一是从两个不同位置复制内容并将它们放在一起,然后对整个集合进行处理。现在,由于两个位置中的文件名称相似(part-00000 到part-00342),因此我的文件在复制过程中会被覆盖。
默认情况下,pig 在给定位置生成具有这种文件名格式的输出文件。最初,我常常将 Pig 输出文件下载到我的磁盘上,编写一个 Python 程序来重命名它们,然后将它们上传回 S3。由于数据量巨大,我现在无法做到这一点。
我不拥有实际执行此复制的管道步骤。我所能控制的(也许)只是被复制的文件的名称。所以我需要知道是否有一种方法可以为pig创建的零件文件的名称附加前缀。
谢谢
I am running an elastic mapreduce pipeline that uses output from multiple pig scripts. Essentially the output of a pig script is stored at a certain location on S3, and since the size of the data is huge, the output files created are named as part-xxxxx.
Now my problem is that one of the steps in my pipeline is copying the contents from two different locations and putting them together and then do processing over this entire collection. Now since the files in both locations are named similarly (part-00000 to part-00342), my files get overwritten during the copy process.
By default pig generates my output files at a given location with such a format for filename. Initially I used to download the pig output files to my disk, write a python program to rename them, and upload them back on to S3. I cannot do that now because of the sheer amount of data.
I do not own the pipeline steps which actually do this copying. All I have control over (perhaps) is the names of the files getting copied). So I need to know whether there is a way for me to attach a prefix to the names of the part- files created by pig.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不确定你可以更改 pig 中的前缀。
即使您说过您无法控制它,我绝对认为最好使下游进程有两个输入目录。必须将两个目录复制到一个目录中才能进行下一步,这听起来确实效率很低。
如果确实需要,您可以使用 hadoop 使用流作业进行重命名,其中流命令是“hadoop fs -cp”。如果您还没有见过这种方法,请告诉我,我可以将其写为博客文章,无论如何都是这样的……
垫子
i'm not sure you can change the prefix in pig.
even though you've said you don't have control over it i definitely think it'd be best to make the downstream process two input directories. it sounds really inefficient to have to copy the two directories into one just for the next step.
if you really have to though you can do the rename itself using hadoop using a streaming job where the streaming command is a 'hadoop fs -cp'. let me know if you haven't seen this approach and i can write it up as a blog post, have been meaning to anyways...
mat
您可以使用以下方法对其进行一些更改:
You can change it somewhat using: