当前位置：文江博客话题详情

hadoop apache-pig elastic-map-reduce

将 Hadoop Pig 输出作为 JSON 数据发布到 URL？

发布于 2024-11-17 07:11:53 字数 342 浏览 12 评论 0 原文

我有一个 Pig 作业，它分析日志文件并将摘要输出写入 S3。我不想将输出写入 S3，而是希望将其转换为 JSON 负载并将其 POST 到 URL。

一些注意事项：

该作业在 Amazon Elastic MapReduce 上运行。
我可以使用 STREAM 通过外部命令传输数据，并从那里加载数据。但是因为 Pig 从不向外部命令发送 EOF，这意味着我需要在每一行到达时对其进行 POST，并且无法对它们进行批处理。显然，这会损害性能。

解决这个问题的最佳方法是什么？ PiggyBank 或其他库中有我可以使用的东西吗？或者我应该编写一个新的存储适配器？谢谢您的建议！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

未央 2024-11-24 07:11:53

您可以编写 UDF，而不是流式传输（因为 UDF 的 do 提供 finish() 回调）[1]

另一种方法可能是执行 POST 作为数据的第二次传递。

您现有的 Pig 步骤只是将单个关系写为 json 字符串，
使用 NLineInputFormat 批量执行 POST 的简单流作业

我总是喜欢这种风格的方法，因为它分离了关注点并使 Pig 代码干净。

它还允许您（在我看来）对工作的 POST 部分进行更简单的调整选项。在这种情况下，根据接收 Web 服务的幂等性关闭推测执行对您来说（可能）很重要。请注意，运行大量并发作业的集群也会完全杀死服务器：D，

例如批量发布 20 个...

$ hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -D mapred.line.input.format.linespermap=20 \
  -D mapred.reduce.tasks.speculative.execution=false \
  -input json_data_to_be_posted -output output \
  -mapper your_posting_script_here.sh \
  -numReduceTasks 0 \
  -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29

Rather than streaming you could write a UDF (since UDF's do provide a finish() callback) [1]

Another approach could be to do the POST as a second pass over the data.

your existing pig step that just writes out to a single relation as json strings
a simple streaming job using NLineInputFormat to do the POST in batchs

I always favor this style of approach since it seperates the concerns and makes the pig code clean.

It also allows you (in my mind) simpler tuning options on the POST portion of your job. In this case it's (probably) important for you to turn off speculative execution depending on the idempotence of your receiving webservice. Beware that your cluster running lots of concurrent jobs can totally kill a server too :D

eg for posting in batches of 20...

$ hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -D mapred.line.input.format.linespermap=20 \
  -D mapred.reduce.tasks.speculative.execution=false \
  -input json_data_to_be_posted -output output \
  -mapper your_posting_script_here.sh \
  -numReduceTasks 0 \
  -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29

回复收藏 0 原文

夏花。依旧 2024-11-24 07:11:53

也许您应该在 Pig 之外处理数据的发布。我发现将 Pig 包装在 bash 中通常比在后处理步骤（无双关语）中执行一些 UDF 更容易。如果您不想让它到达 S3，您可以使用 dump 而不是 store 并处理要发布的标准输出。否则，将其存储在S3中，使用hadoop fs -cat outputpath/part*将其拉出，然后使用curl或其他方式将其发送出去。

回复收藏 0 原文

宁愿没拥抱 2024-11-24 07:11:53

事实证明，Pig确实正确地将 EOF 发送到外部命令，因此您可以选择通过外部脚本流式传输所有内容。如果它不起作用，那么您可能遇到了难以调试的配置问题。

以下是如何开始。使用您需要的任何解释器和脚本定义外部命令，如下所示：

DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;

通过脚本流式传输结果：

/* Write our results to our Ruby script for uploading.  We add
   a trailing bogus DUMP to make sure something actually gets run. */
empty = STREAM results THROUGH UPLOAD_RESULTS;
DUMP empty;

从 Ruby，您可以将输入记录批处理为 1024 个块：

STDIN.each_line.each_slice(1024) do |chunk|
  # 'chunk' is an array of 1024 lines, each consisting of tab-separated
  # fields followed by a newline. 
end

如果这不起作用，请仔细检查以下内容：

您的脚本是否从命令行工作？
从 Pig 运行时，您的脚本是否具有所有必需的环境变量？
您的 EC2 引导操作是否正常工作？

其中一些很难验证，但如果其中任何一个失败，您很容易会浪费大量时间进行调试。

但请注意，您应该强烈考虑 mat kelcey 推荐的替代方法。

As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn't working, then you probably have a hard-to-debug configuration problem.

Here's how to get started. Define an external command as follows, using whatever interpreter and script you need:

DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;

Stream the results through your script:

/* Write our results to our Ruby script for uploading.  We add
   a trailing bogus DUMP to make sure something actually gets run. */
empty = STREAM results THROUGH UPLOAD_RESULTS;
DUMP empty;

From Ruby, you can batch the input records into blocks of 1024:

STDIN.each_line.each_slice(1024) do |chunk|
  # 'chunk' is an array of 1024 lines, each consisting of tab-separated
  # fields followed by a newline. 
end

If this fails to work, check the following carefully:

Does your script work from the command line?
When run from Pig, does your script have all the necessary environment variables?
Are your EC2 bootstrap actions working correctly?

Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.

Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.

回复收藏 0 原文

~没有更多了~

关于作者

别闹i

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

将 Hadoop Pig 输出作为 JSON 数据发布到 URL？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

qq_VRzBBA45

痴情

。

Mu.

凉薄对峙

不落城

友情链接

将 Hadoop Pig 输出作为 JSON 数据发布到 URL？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

qq_VRzBBA45

痴情

。

Mu.

凉薄对峙

不落城

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。