通过 PHP 实现 Pig 的流
我有一个 Pig 脚本(当前在本地模式下运行),它处理一个包含类别列表的巨大文件:
/root/level1/level2/level3
/root/level1/level2/level3/level4
...
我需要通过调用存储过程将每个类别插入到现有数据库中。因为我是 Pig 新手,而且 UDF 接口有点令人畏惧,所以我尝试通过 PHP 脚本流式传输文件内容来完成一些工作。
不过,我发现 PHP 脚本只能看到我正在通过它的一半类别行。更准确地说,我看到返回了 ceil(pig_categories/2)
的记录。 15 的限制将在流过 PHP 脚本后产生 8 个条目——最后一个将为空。
-- Pig script snippet
ordered = ORDER mappable_categories BY category;
limited = LIMIT ordered 20;
categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories
streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories
# categorize.php
$category = fgets( STDIN );
echo $category;
关于我所缺少的任何想法。我已经研究了 Pig 参考手册有一段时间了,似乎没有太多与通过 PHP 脚本进行流式传输相关的信息。我还尝试了 IRC 上的 #hadoop 频道,但没有成功。任何指导将不胜感激。
谢谢。
更新
越来越明显的是,这与 EOL 相关。如果我将 PHP 脚本从使用 fgets()
更改为 stream_get_line()
,那么我会返回 10 个项目,但应该跳过第一个记录,并且有一个尾随显示的空记录。
(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()
在该结果集中,应该有第一项 (Arts)
。正在接近,但仍有一些差距需要缩小。
I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories:
/root/level1/level2/level3
/root/level1/level2/level3/level4
...
I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'm trying to get something done by streaming the file's content through a PHP script.
I'm finding that the PHP script only sees half of the category lines I'm passing through it, though. More precisely, I see a record returned for ceil( pig_categories/2 )
. A limit of 15 will produce 8 entries after streaming through the PHP script--the last one will be empty.
-- Pig script snippet
ordered = ORDER mappable_categories BY category;
limited = LIMIT ordered 20;
categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories
streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories
# categorize.php
$category = fgets( STDIN );
echo $category;
Any thoughts on what I'm missing. I've poured over the Pig reference manual for a while now and there doesn't seem to be much information related to streaming through a PHP script. I've also tried the #hadoop channel on IRC to no avail. Any guidance would be much appreciated.
Thanks.
UPDATE
It's becoming evident that this is EOL-related. If I change the PHP script from using fgets()
to stream_get_line()
, then I get 10 items back, but the record that should be first is skipped and there's a trailing empty record that gets displayed.
(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()
In that result set, there should be a first item of (Arts)
. Closing in, but there's still some gap to close.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
事实证明,这是空白很重要的情况之一。我的开始
标记前面有一个空行。一旦我把所有这些都收紧了,一切都会顺利进行并按预期产生。 /惩罚性的头掌/
So it turns out that this is one of those instances where whitespace matters. I had an empty line in front of my opening
<?php
tag. Once I tightened all of that up, everything sailed through and produced as expected. /punitive headslap/