Pig 如何在“负载”中使用 Hadoop Glob?陈述?
正如我之前指出的,Pig 不能很好地处理空(0 字节)文件。不幸的是,有很多方法可以创建这些文件(甚至在 Hadoop 实用程序中)。
我认为我可以通过显式加载与 LOAD 语句 使用 Hadoop 的 glob 语法。不幸的是,这似乎不起作用,因为即使我使用 glob 过滤到已知良好的输入文件,我仍然遇到 0 字节故障 前面提到过。
下面是一个示例:假设我在 S3 中有以下文件:
- mybucket/a/b/ (0 bytes)
- mybucket/a/b/myfile.log (>0 bytes)
- mybucket/a/b/yourfile.log (>0 bytes) 0 字节)
如果我在 Pig 脚本中使用这样的 LOAD 语句:
myData = load 's3://mybucket/a/b/*.log as ( ... )
我希望 Pig 不会被 0 字节文件阻塞,但它仍然会阻塞。有没有什么技巧可以让 Pig 实际上只查看与预期的全局模式匹配的文件?
As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).
I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.
Here's an example: Assume I have the following files in S3:
- mybucket/a/b/ (0 bytes)
- mybucket/a/b/myfile.log (>0 bytes)
- mybucket/a/b/yourfile.log (>0 bytes)
If I use a LOAD statement like this in my pig script:
myData = load 's3://mybucket/a/b/*.log as ( ... )
I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一个相当丑陋的解决方案,但不依赖于
*
通配符语法的 glob 似乎可以工作。因此,在我们的工作流程中(在调用 Pig 脚本之前),我们列出了我们感兴趣的前缀下方的所有文件,然后创建一个仅包含我们感兴趣的路径的特定 glob。例如,在在上面的示例中,我们列出“mybucket/a”:
它返回文件列表以及其他元数据。然后我们可以根据该数据创建 glob:
这需要更多的前端工作,但允许我们专门定位我们感兴趣的文件并避免 0 字节文件。
更新:不幸的是,我发现当 glob 模式变长时,这个解决方案会失败; Pig 最终抛出异常“无法创建输入切片”。
This is a fairly ugly solution, but globs that don't rely on the
*
wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.For example, in the example above, we list "mybucket/a":
Which returns a list of files, plus other metadata. We can then create the glob from that data:
This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.
Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".