如何从 Pig 中的关系生成自定义模式?

发布于 2024-11-01 06:59:13 字数 685 浏览 4 评论 0 原文

我有一个描述各种文章中单词的 tf-idf 值的模式。 它的描述如下:

tfidf_relation: {word: chararray,id: bytearray,tfidf: double}

这是此类数据的示例:

(cat,article_one,0.13515503603605478)
(cat,article_two,0.4054651081081644)
(dog,article_one,0.3662040962227032)
(apple,article_three,0.3662040962227032)
(orange,article_three,0.3662040962227032)
(parrot,article_one,0.13515503603605478)
(parrot,article_three,0.13515503603605478)

我想以某种形式获取输出: 猫文章_一0.13515503603605478,文章_二0.4054651081081644 等等。 问题是,如何从中建立一个包含单词 field 以及 id 和 tfidf 字段元组的关系? 像这样的事情:

X = FOREACH tfidf_relation GENERATE word, (id, tfidf);

不起作用。正确的语法是什么?

I have a schema describing tf-idf values for words in various articles.
Its description looks like:

tfidf_relation: {word: chararray,id: bytearray,tfidf: double}

Here is an example of such data:

(cat,article_one,0.13515503603605478)
(cat,article_two,0.4054651081081644)
(dog,article_one,0.3662040962227032)
(apple,article_three,0.3662040962227032)
(orange,article_three,0.3662040962227032)
(parrot,article_one,0.13515503603605478)
(parrot,article_three,0.13515503603605478)

I want to get output in a form:
cat article_one 0.13515503603605478, article_two 0.4054651081081644
and so on.
The question is, how do I make a relation from this which contains the word field and a tuple of id and tfidf fields?
Someting like this:

X = FOREACH tfidf_relation GENERATE word, (id, tfidf);

doesn't work. What is the correct syntax for this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

ら栖息 2024-11-08 06:59:13

试试这个:

    t = LOAD 'input/file' USING PigStorage(',') as (word: chararray,id: bytearray,tfidf: double);
    u = group t by word;
    dump u;

输出将是

    (cat,{(cat,article_two,0.4054651081081644),(cat,article_one,0.13515503603605478)})
    (dog,{(dog,article_one,0.3662040962227032)})
    (apple,{(apple,article_three,0.3662040962227032)})
    (orange,{(orange,article_three,0.366204096222703)})
    (parrot,{(parrot,article_three,0.13515503603605478),
    (parrot,article_one,0.13515503603605478)})

我希望这就是您正在寻找的。

Try this:

    t = LOAD 'input/file' USING PigStorage(',') as (word: chararray,id: bytearray,tfidf: double);
    u = group t by word;
    dump u;

The output will be

    (cat,{(cat,article_two,0.4054651081081644),(cat,article_one,0.13515503603605478)})
    (dog,{(dog,article_one,0.3662040962227032)})
    (apple,{(apple,article_three,0.3662040962227032)})
    (orange,{(orange,article_three,0.366204096222703)})
    (parrot,{(parrot,article_three,0.13515503603605478),
    (parrot,article_one,0.13515503603605478)})

I hope this is what you are looking for.

独闯女儿国 2024-11-08 06:59:13
X = FOREACH tfidf_relation GENERATE word, {(id, tfidf)};

这可能就是您所需要的。

X = FOREACH tfidf_relation GENERATE word, {(id, tfidf)};

This is probably what you need.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文