使用 Hadoop Pig 生成多个输出
我有这个文件,其中包含 Hadoop 中的数据列表。我构建了一个简单的 Pig 脚本,它通过 id 号 分析文件,依此类推...
我正在寻找的最后一步是这样的:我想创建 (store ) 每个唯一id 号的文件。所以这应该取决于一个小组步骤......但是,我不明白这是否可能(也许有一个自定义商店模块?)。
有什么想法吗?
谢谢
丹尼尔
I've got this file containing a list of data in Hadoop. I've build a simple Pig script which analyze the file by the id number, and so on...
The last step I'm looking for is this: I'd like to to create (store) a file for each unique id number. So this should depend on a group step...however, I haven't understood if this is possible (maybe there is a custom store module?).
Any idea?
Thanks
Daniele
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在牢记脆弱所说的同时, MultiStorage,在 PiggyBank 中,似乎就是您正在寻找的东西。
While keeping in mind what is said by frail, MultiStorage, in PiggyBank, seems to be what you are looking for.
为了获取输出(文件或任何内容),您需要将数据分配给变量,这就是它与 STORE 的工作原理。如果 id 是有限的,您可以一一
FILTER
它们,然后STORE
它们。 (我总是对大约 20-25 个动作类型这样做)。但如果您非常需要获取每个唯一的 id 文件,则创建 2 个文件。 1 包含按 id 分组的全部数据,1 包含唯一的 id。然后尝试生成 1 个(或者更多,如果您有太多)按该 id 进行过滤的 Pig 脚本。但这是一个糟糕的解决方案。假设您将 10 个 id 分组到一个 pig 脚本中,您将需要运行(唯一 id 计数/10)个 pig 脚本。
请注意,Hdfs 不擅长处理太多小文件。
编辑:
更好的解决方案是按大文件的唯一 ID 进行分组和排序。然后,由于其已排序,您可以轻松地使用第 3 方脚本来划分内容。
for getting an output(file or anything) you need to assign data to a variable, thats how it works with
STORE
. If id's are limited and finite you canFILTER
them one by one and thenSTORE
them. (I always do that for action types which is about 20-25).But if you need to get each unique id file badly then make 2 files. 1 with whole data in it grouped by id, 1 with just unique ids. Then try generating 1(or more if you have too many) pig scripts that FILTER BY that id. But it's a bad solution. Assuming you would group 10 ids in a pig script you would have (unique id count/10) pig scripts to run.
Beware that Hdfs ain't good at handling too many small files.
Edit:
A better solution would be to GROUP and SORT by unique id to a big file. Then since its sorted you can easily divide the contents with a 3rd party script.