Hadoop Pig关联使用

发布于 2024-12-20 09:13:33 字数 891 浏览 3 评论 0原文

我有一个向量列表,我想通过输入向量(数字)对其进行关联。我应该如何存储我的向量列表,以及如何传递我的输入向量并将其传递给 Pig 的 COR() 函数

-- SET command?  what is it used for? this doesn't work
SET input_nums {0,2,0,1,2,0,0,0,0} AS bag{}

-- storing vectors in this format doesn't seem to work 
-- import via: data = LOAD mynums AS (id:long, nums:bag{});
1\t{1,3,3,4,5}
2\t{3,4,5,6,6}

-- this seems to work, but adds overhead on storage
-- import via: data = LOAD mynums AS (id:long, nums:bag{t:(x:long)});
1\t{(1),(3),(3),(4),(5)}
2\t{(3),(4),(5),(6),(6)}

-- assuming "data" and "input_nums" are set, no idea how to use though:
results = COR(data, input_nums) -- nope
results = FOREACH data GENERATE id, COR(nums, input_nums) -- nope

不太重要的附带问题:我见过带有参数的猪脚本。我可以通过这些参数(即字符串参数,然后 Pig 放入袋子中)传入我的 input_nums 吗?

I have a list of vectors that I want to run correlation against via an input vector (of numbers). How should I store my list of vectors, and how do I pass in my input vector and pass it to Pig's COR() function?

-- SET command?  what is it used for? this doesn't work
SET input_nums {0,2,0,1,2,0,0,0,0} AS bag{}

-- storing vectors in this format doesn't seem to work 
-- import via: data = LOAD mynums AS (id:long, nums:bag{});
1\t{1,3,3,4,5}
2\t{3,4,5,6,6}

-- this seems to work, but adds overhead on storage
-- import via: data = LOAD mynums AS (id:long, nums:bag{t:(x:long)});
1\t{(1),(3),(3),(4),(5)}
2\t{(3),(4),(5),(6),(6)}

-- assuming "data" and "input_nums" are set, no idea how to use though:
results = COR(data, input_nums) -- nope
results = FOREACH data GENERATE id, COR(nums, input_nums) -- nope

Less important side question: I've seen pig scripts that take arguments. Can I pass in my input_nums via these arguments (i.e. string argument, then Pig makes into a bag)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

如梦初醒的夏天 2024-12-27 09:13:33

关于在 Pig 中运行 COR 的唯一要求是输入参数是双精度数袋。另外,请确保您的 Pig 版本 >=0.90.1(请参阅 JIRA :PIG-2286)。

输入数据:
110
212
313
414

脚本:
data = LOAD 'cor.txt' AS (series1:double, series2:double);
rel = GROUP 数据全部;
corop = FOREACH rel GENERATE COR(data.series1, data.series2);
dump coro;

输出:
({(var0,var1,0.9827076298239908)})

The only requirement with respect to running COR in Pig is that the input arguments be bags of doubles. Also, make sure you have a pig version that is >=0.90.1 (refer to JIRA: PIG-2286).

Input data:
1<tab>10
2<tab>12
3<tab>13
4<tab>14

Script:
data = LOAD 'cor.txt' AS (series1:double, series2:double);
rel = GROUP data ALL;
corop = FOREACH rel GENERATE COR(data.series1, data.series2);
dump corop;

Output:
({(var0,var1,0.9827076298239908)})

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文