将 Hadoop Pig 输出作为 JSON 数据发布到 URL?
我有一个 Pig 作业,它分析日志文件并将摘要输出写入 S3。我不想将输出写入 S3,而是希望将其转换为 JSON 负载并将其 POST 到 URL。
一些注意事项:
- 该作业在 Amazon Elastic MapReduce 上运行。
- 我可以使用 STREAM 通过外部命令传输数据,并从那里加载数据。但是因为 Pig 从不向外部命令发送 EOF,这意味着我需要在每一行到达时对其进行 POST,并且无法对它们进行批处理。显然,这会损害性能。
解决这个问题的最佳方法是什么? PiggyBank 或其他库中有我可以使用的东西吗?或者我应该编写一个新的存储适配器?谢谢您的建议!
I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.
Some notes:
- This job is running on Amazon Elastic MapReduce.
- I can use a STREAM to pipe the data through an external command, and load it from there. But because Pig never sends an EOF to external commands, this means I need to POST each row as it arrives, and I can't batch them. Obviously, this hurts performance.
What's the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以编写 UDF,而不是流式传输(因为 UDF 的 do 提供 finish() 回调)[1]
另一种方法可能是执行 POST 作为数据的第二次传递。
我总是喜欢这种风格的方法,因为它分离了关注点并使 Pig 代码干净。
它还允许您(在我看来)对工作的 POST 部分进行更简单的调整选项。在这种情况下,根据接收 Web 服务的幂等性关闭推测执行对您来说(可能)很重要。请注意,运行大量并发作业的集群也会完全杀死服务器:D,
例如批量发布 20 个...
[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29
Rather than streaming you could write a UDF (since UDF's do provide a finish() callback) [1]
Another approach could be to do the POST as a second pass over the data.
I always favor this style of approach since it seperates the concerns and makes the pig code clean.
It also allows you (in my mind) simpler tuning options on the POST portion of your job. In this case it's (probably) important for you to turn off speculative execution depending on the idempotence of your receiving webservice. Beware that your cluster running lots of concurrent jobs can totally kill a server too :D
eg for posting in batches of 20...
[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29
也许您应该在 Pig 之外处理数据的发布。我发现将 Pig 包装在 bash 中通常比在后处理步骤(无双关语)中执行一些 UDF 更容易。如果您不想让它到达 S3,您可以使用
dump
而不是store
并处理要发布的标准输出。否则,将其存储在S3中,使用hadoop fs -cat outputpath/part*
将其拉出,然后使用curl
或其他方式将其发送出去。Perhaps you should handle the posting of the data outside of Pig. I find that wrapping my Pig in bash usually is easier than doing some UDF of a post (no pun intended) processing step. If you never want it hitting S3, you can use
dump
instead ofstore
and handle the standard out to be posted. Otherwise, store it in S3, pull it out withhadoop fs -cat outputpath/part*
then send it out withcurl
or something.事实证明,Pig确实正确地将 EOF 发送到外部命令,因此您可以选择通过外部脚本流式传输所有内容。如果它不起作用,那么您可能遇到了难以调试的配置问题。
以下是如何开始。使用您需要的任何解释器和脚本定义外部命令,如下所示:
通过脚本流式传输结果:
从 Ruby,您可以将输入记录批处理为 1024 个块:
如果这不起作用,请仔细检查以下内容:
其中一些很难验证,但如果其中任何一个失败,您很容易会浪费大量时间进行调试。
但请注意,您应该强烈考虑 mat kelcey 推荐的替代方法。
As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn't working, then you probably have a hard-to-debug configuration problem.
Here's how to get started. Define an external command as follows, using whatever interpreter and script you need:
Stream the results through your script:
From Ruby, you can batch the input records into blocks of 1024:
If this fails to work, check the following carefully:
Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.
Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.