将 Hadoop Pig 输出作为 JSON 数据发布到 URL?

发布于 2024-11-17 07:11:53 字数 342 浏览 5 评论 0原文

我有一个 Pig 作业,它分析日志文件并将摘要输出写入 S3。我不想将输出写入 S3,而是希望将其转换为 JSON 负载并将其 POST 到 URL。

一些注意事项:

  • 该作业在 Amazon Elastic MapReduce 上运行。
  • 我可以使用 STREAM 通过外部命令传输数据,并从那里加载数据。但是因为 Pig 从不向外部命令发送 EOF,这意味着我需要在每一行到达时对其进行 POST,并且无法对它们进行批处理。显然,这会损害性能。

解决这个问题的最佳方法是什么? PiggyBank 或其他库中有我可以使用的东西吗?或者我应该编写一个新的存储适配器?谢谢您的建议!

I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.

Some notes:

  • This job is running on Amazon Elastic MapReduce.
  • I can use a STREAM to pipe the data through an external command, and load it from there. But because Pig never sends an EOF to external commands, this means I need to POST each row as it arrives, and I can't batch them. Obviously, this hurts performance.

What's the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

未央 2024-11-24 07:11:53

您可以编写 UDF,而不是流式传输(因为 UDF 的 do 提供 finish() 回调)[1]

另一种方法可能是执行 POST 作为数据的第二次传递。

  1. 您现有的 Pig 步骤只是将单个关系写为 json 字符串,
  2. 使用 NLineInputFormat 批量执行 POST 的简单流作业

我总是喜欢这种风格的方法,因为它分离了关注点并使 Pig 代码干净。

它还允许您(在我看来)对工作的 POST 部分进行更简单的调整选项。在这种情况下,根据接收 Web 服务的幂等性关闭推测执行对您来说(可能)很重要。请注意,运行大量并发作业的集群也会完全杀死服务器:D,

例如批量发布 20 个...

$ hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -D mapred.line.input.format.linespermap=20 \
  -D mapred.reduce.tasks.speculative.execution=false \
  -input json_data_to_be_posted -output output \
  -mapper your_posting_script_here.sh \
  -numReduceTasks 0 \
  -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29

Rather than streaming you could write a UDF (since UDF's do provide a finish() callback) [1]

Another approach could be to do the POST as a second pass over the data.

  1. your existing pig step that just writes out to a single relation as json strings
  2. a simple streaming job using NLineInputFormat to do the POST in batchs

I always favor this style of approach since it seperates the concerns and makes the pig code clean.

It also allows you (in my mind) simpler tuning options on the POST portion of your job. In this case it's (probably) important for you to turn off speculative execution depending on the idempotence of your receiving webservice. Beware that your cluster running lots of concurrent jobs can totally kill a server too :D

eg for posting in batches of 20...

$ hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -D mapred.line.input.format.linespermap=20 \
  -D mapred.reduce.tasks.speculative.execution=false \
  -input json_data_to_be_posted -output output \
  -mapper your_posting_script_here.sh \
  -numReduceTasks 0 \
  -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29

夏花。依旧 2024-11-24 07:11:53

也许您应该在 Pig 之外处理数据的发布。我发现将 Pig 包装在 bash 中通常比在后处理步骤(无双关语)中执行一些 UDF 更容易。如果您不想让它到达 S3,您可以使用 dump 而不是 store 并处理要发布的标准输出。否则,将其存储在S3中,使用hadoop fs -cat outputpath/part*将其拉出,然后使用curl或其他方式将其发送出去。

Perhaps you should handle the posting of the data outside of Pig. I find that wrapping my Pig in bash usually is easier than doing some UDF of a post (no pun intended) processing step. If you never want it hitting S3, you can use dump instead of store and handle the standard out to be posted. Otherwise, store it in S3, pull it out with hadoop fs -cat outputpath/part* then send it out with curl or something.

宁愿没拥抱 2024-11-24 07:11:53

事实证明,Pig确实正确地将 EOF 发送到外部命令,因此您可以选择通过外部脚本流式传输所有内容。如果它不起作用,那么您可能遇到了难以调试的配置问题。

以下是如何开始。使用您需要的任何解释器和脚本定义外部命令,如下所示:

DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;

通过脚本流式传输结果:

/* Write our results to our Ruby script for uploading.  We add
   a trailing bogus DUMP to make sure something actually gets run. */
empty = STREAM results THROUGH UPLOAD_RESULTS;
DUMP empty;

从 Ruby,您可以将输入记录批处理为 1024 个块:

STDIN.each_line.each_slice(1024) do |chunk|
  # 'chunk' is an array of 1024 lines, each consisting of tab-separated
  # fields followed by a newline. 
end

如果这不起作用,请仔细检查以下内容:

  1. 您的脚本是否从命令行工作?
  2. 从 Pig 运行时,您的脚本是否具有所有必需的环境变量?
  3. 您的 EC2 引导操作是否正常工作?

其中一些很难验证,但如果其中任何一个失败,您很容易会浪费大量时间进行调试。

但请注意,您应该强烈考虑 mat kelcey 推荐的替代方法。

As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn't working, then you probably have a hard-to-debug configuration problem.

Here's how to get started. Define an external command as follows, using whatever interpreter and script you need:

DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;

Stream the results through your script:

/* Write our results to our Ruby script for uploading.  We add
   a trailing bogus DUMP to make sure something actually gets run. */
empty = STREAM results THROUGH UPLOAD_RESULTS;
DUMP empty;

From Ruby, you can batch the input records into blocks of 1024:

STDIN.each_line.each_slice(1024) do |chunk|
  # 'chunk' is an array of 1024 lines, each consisting of tab-separated
  # fields followed by a newline. 
end

If this fails to work, check the following carefully:

  1. Does your script work from the command line?
  2. When run from Pig, does your script have all the necessary environment variables?
  3. Are your EC2 bootstrap actions working correctly?

Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.

Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文