Pig 如何在“负载”中使用 Hadoop Glob？陈述？

发布于 2024-11-02 19:13:10 字数 1190 浏览 7 评论 0原文

正如我之前指出的，Pig 不能很好地处理空（0 字节）文件。不幸的是，有很多方法可以创建这些文件（甚至在 Hadoop 实用程序中）。

我认为我可以通过显式加载与 LOAD 语句使用 Hadoop 的 glob 语法。不幸的是，这似乎不起作用，因为即使我使用 glob 过滤到已知良好的输入文件，我仍然遇到 0 字节故障前面提到过。

下面是一个示例：假设我在 S3 中有以下文件：

mybucket/a/b/ (0 bytes)
mybucket/a/b/myfile.log (>0 bytes)
mybucket/a/b/yourfile.log (>0 bytes) 0 字节）

如果我在 Pig 脚本中使用这样的 LOAD 语句：

myData = load 's3://mybucket/a/b/*.log as ( ... )

我希望 Pig 不会被 0 字节文件阻塞，但它仍然会阻塞。有没有什么技巧可以让 Pig 实际上只查看与预期的全局模式匹配的文件？

原文

As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).

I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.

Here's an example: Assume I have the following files in S3:

mybucket/a/b/ (0 bytes)
mybucket/a/b/myfile.log (>0 bytes)
mybucket/a/b/yourfile.log (>0 bytes)

If I use a LOAD statement like this in my pig script:

myData = load 's3://mybucket/a/b/*.log as ( ... )

I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

阳光的暖冬 2024-11-09 19:13:10

这是一个相当丑陋的解决方案，但不依赖于 * 通配符语法的 glob 似乎可以工作。因此，在我们的工作流程中（在调用 Pig 脚本之前），我们列出了我们感兴趣的前缀下方的所有文件，然后创建一个仅包含我们感兴趣的路径的特定 glob。

例如，在在上面的示例中，我们列出“mybucket/a”：

hadoop fs -lsr s3://mybucket/a

它返回文件列表以及其他元数据。然后我们可以根据该数据创建 glob：

myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )

这需要更多的前端工作，但允许我们专门定位我们感兴趣的文件并避免 0 字节文件。

更新：不幸的是，我发现当 glob 模式变长时，这个解决方案会失败； Pig 最终抛出异常“无法创建输入切片”。

This is a fairly ugly solution, but globs that don't rely on the * wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.

For example, in the example above, we list "mybucket/a":

hadoop fs -lsr s3://mybucket/a

Which returns a list of files, plus other metadata. We can then create the glob from that data:

myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )

This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.

Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".

回复收藏 0 原文

~没有更多了~