通过 PHP 实现 Pig 的流

发布于 2024-09-25 18:06:34 字数 1526 浏览 9 评论 0原文

我有一个 Pig 脚本(当前在本地模式下运行),它处理一个包含类别列表的巨大文件:

/root/level1/level2/level3
/root/level1/level2/level3/level4
...

我需要通过调用存储过程将每个类别插入到现有数据库中。因为我是 Pig 新手,而且 UDF 接口有点令人畏惧,所以我尝试通过 PHP 脚本流式传输文件内容来完成一些工作。

不过,我发现 PHP 脚本只能看到我正在通过它的一半类别行。更准确地说,我看到返回了 ceil(pig_categories/2) 的记录。 15 的限制将在流过 PHP 脚本后产生 8 个条目——最后一个将为空。

-- Pig script snippet
ordered  = ORDER mappable_categories BY category;
limited  = LIMIT ordered 20;

categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories

streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories

# categorize.php
$category = fgets( STDIN );
echo $category;

关于我所缺少的任何想法。我已经研究了 Pig 参考手册有一段时间了,似乎没有太多与通过 PHP 脚本进行流式传输相关的信息。我还尝试了 IRC 上的 #hadoop 频道,但没有成功。任何指导将不胜感激。

谢谢。

更新

越来越明显的是,这与 EOL 相关。如果我将 PHP 脚本从使用 fgets() 更改为 stream_get_line(),那么我会返回 10 个项目,但应该跳过第一个记录,并且有一个尾随显示的空记录。

(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()

在该结果集中,应该有第一项 (Arts)。正在接近,但仍有一些差距需要缩小。

I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories:

/root/level1/level2/level3
/root/level1/level2/level3/level4
...

I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'm trying to get something done by streaming the file's content through a PHP script.

I'm finding that the PHP script only sees half of the category lines I'm passing through it, though. More precisely, I see a record returned for ceil( pig_categories/2 ). A limit of 15 will produce 8 entries after streaming through the PHP script--the last one will be empty.

-- Pig script snippet
ordered  = ORDER mappable_categories BY category;
limited  = LIMIT ordered 20;

categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories

streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories

# categorize.php
$category = fgets( STDIN );
echo $category;

Any thoughts on what I'm missing. I've poured over the Pig reference manual for a while now and there doesn't seem to be much information related to streaming through a PHP script. I've also tried the #hadoop channel on IRC to no avail. Any guidance would be much appreciated.

Thanks.

UPDATE

It's becoming evident that this is EOL-related. If I change the PHP script from using fgets() to stream_get_line(), then I get 10 items back, but the record that should be first is skipped and there's a trailing empty record that gets displayed.

(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()

In that result set, there should be a first item of (Arts). Closing in, but there's still some gap to close.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

锦欢 2024-10-02 18:06:42

事实证明,这是空白很重要的情况之一。我的开始 标记前面有一个空行。一旦我把所有这些都收紧了,一切都会顺利进行并按预期产生。 /惩罚性的头掌/

So it turns out that this is one of those instances where whitespace matters. I had an empty line in front of my opening <?php tag. Once I tightened all of that up, everything sailed through and produced as expected. /punitive headslap/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文