为什么elephantbird Pig JsonLoader只处理我的文件的一部分?
我在 Amazon 的 Elastic Map-Reduce 上使用 Pig 来进行批量分析。我的输入文件位于 S3 上,包含每行一个 JSON 字典表示的事件。我使用elephantbird JsonLoader 库来解析输入文件。到目前为止,一切都很好。
我在交互式 Pig 会话中处理存储在本地文件系统或 hdfs 上的大文件时遇到问题。看起来,如果输入文件足够大,可以进行分割,则大象鸟只处理其中一个分割,并且处理停止,分割结束时没有错误消息。如果我从 S3 流式传输输入(S3 输入上没有文件分割),或者如果我将文件转换为 Pig 直接可读的格式,我不会遇到同样的问题。
举一个具体的例子:一个有 833,138 行的文件最多只处理 379,751 行(如果我观察 Pig 中的完成百分比,它会顺利地达到 50%,然后跳到 100%)。我还尝试了一个 400,000 行的文件,它处理得很好。
所以我的问题是:为什么大象鸟只处理一个分割?我是否误解了 Pig 在交互模式下的工作原理,或者是否发生了严重错误?
I'm using Pig on Amazon's Elastic Map-Reduce to do batch analytics. My input files are on S3 and contain events that are represented by one JSON dictionary per line. I use the elephantbird JsonLoader library to parse the input files. So far so good.
I'm running into problems processing a large file stored on the local filesystem or hdfs in an interactive Pig session. It looks like if the input file is large enough to get split, only one of the splits is ever processed by elephantbird, and the processing stops with no error message at the end of the split. I don't have the same problem if I stream the input from S3 (no file splitting on S3 input), or if I convert the file to a format readable by Pig directly.
For a concrete example: a file with 833,138 lines is only processed up to 379,751 lines (and if I watch the completion percentage in Pig it goes smoothly up to 50% then jumps to 100%). I also tried with a file with 400,000 lines and it got processed fine.
So my question is: why is only one split processed by elephantbird? Am I misunderstanding how Pig in interactive mode is supposed to work or is there something wildly wrong going on?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Katia,如果您向 Pig 用户列表发送电子邮件,您会更快地获得帮助:)。
请尝试 Pig 0.8.1(当前版本),如果仍然出现错误,请告诉我们。值得一提的是,我已经使用 EB Json 加载器处理数百个文件一年多了,它们处理得很好,所以也许您的数据有问题。
Spike Gronim——这个问题已经被修复,本地模式现在与非本地模式基本相同(除了分布式缓存和倾斜连接等)。升级。
Katia, you'll get help much faster if you email the Pig user list :).
Please try Pig 0.8.1 (the current release) and let us know if you still get errors. For what it's worth I've been using the EB Json loader for over a year on hundred-gig files and they process fine, so perhaps there's something about your data.
Spike Gronim -- that's been fixed, local mode is now mostly identical (except for things like distributed cache and skewed joins) to non-local mode. Upgrade.