如何处理 Apache Pig 中的空或丢失的输入文件?
我们的工作流程使用 AWS 弹性 MapReduce 集群来运行一系列 Pig 作业,以将大量数据处理为聚合报告。不幸的是,输入数据可能不一致,并可能导致没有输入文件或 0 字节文件被提供给管道,甚至由管道的某些阶段生成。
在 LOAD 语句期间,如果 Pig 找不到任何输入文件或任何输入文件为 0 字节,它就会严重失败。
有没有什么好的方法可以解决这个问题(希望在 Pig 配置或脚本或 Hadoop 集群配置中,而不需要编写自定义加载程序......)?
(由于我们使用的是 AWS 弹性映射缩减,因此我们只能使用 Pig 0.6.0 和 Hadoop 0.20。)
Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
(对于后代,我们提出了一个低于标准的解决方案:)
为了处理 0 字节问题,我们发现我们可以检测这种情况,然后插入一个带有单个换行符的文件。这会导致如下消息:
但至少 Pig 不会因异常而崩溃。
或者,我们可以为该文件生成一个包含适当数量的
'\t'
字符的行,这样可以避免警告,但它会在数据中插入垃圾,然后我们必须将其过滤掉。这些相同的想法可用于通过创建虚拟文件来解决无输入文件的情况,但它具有与上面列出的相同的缺点。
(For posterity, a sub-par solution we've come up with:)
To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:
but at least Pig doesn't crash with an exception.
Alternatively, we could produce a line with the appropriate number of
'\t'
characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.
我一直使用的方法是从 shell 运行 Pig 脚本。我有一项工作从六个不同的输入目录获取数据。所以我为每个输入文件编写了一个片段。
shell 检查输入文件是否存在,并从片段中组装出最终的pig 脚本。
然后它执行最终的 pig 脚本。我知道这有点像鲁布·戈德堡的做法,但到目前为止还不错。 :-)
The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.
The shell checks for the existence of the input file and assembles a final pig script from the fragments.
It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)