是否可以将关系中的一行与 Pig 中该行中的元组交叉连接？

发布于 2024-12-04 04:32:20 字数 709 浏览 4 评论 0原文

我有一组数据，显示用户、他们喜欢的水果集合以及家乡城市：

Alice\tApple:Orange\tSacramento
Bob\tApple\tSan Diego
Charlie\tApple:Pineapple\tSacramento

我想创建一个猪查询，将不同城市喜欢水果类型的用户数量关联起来，其中查询的结果是上面的数据看起来像这样：

Apple\tSacramento\t2
Apple\tSan Diego\t1
Orange\tSacramento\t1
Pineapple\tSacramento\t1

我不明白的部分是如何将分割的水果行与同一行的其余数据交叉连接，所以：

Alice\tApple:Orange\tSacramento

变成：

Alice\tApple\tSacramento 
Alice\tOrange\tSacramento

我知道我可以使用 TOKENIZE 来分割字符串“苹果：橙子”进入元组（'Apple'，'Orange'），但我不知道如何获得该元组与行的其余部分（'Alice'）的叉积。

我想出的一个强力解决方案是使用流式传输通过外部程序运行输入集合，并处理“交叉连接”以在其中每行生成多行。

但这似乎应该是不必要的。还有更好的想法吗？

原文

I have a set of data that shows users, collections of fruit they like, and home city:

Alice\tApple:Orange\tSacramento
Bob\tApple\tSan Diego
Charlie\tApple:Pineapple\tSacramento

I would like to create a pig query that correlates the number of users that enjoy tyeps of fruits in different cities, where the results from the query for the data above would look like this:

Apple\tSacramento\t2
Apple\tSan Diego\t1
Orange\tSacramento\t1
Pineapple\tSacramento\t1

The part I can't figure out is how to cross join the split fruit rows with the rest of the data from the same row, so:

Alice\tApple:Orange\tSacramento

becomes:

Alice\tApple\tSacramento 
Alice\tOrange\tSacramento

I know I can use TOKENIZE to split the string 'Apple:Orange' into the tuple ('Apple', 'Orange'), but I don't know how to get the cross product of that tuple with the rest of the row ('Alice').

One brute-force solution I came up with is to use the streaming to run the input collection through an external program, and handle the "cross join" to produce multiple rows per row there.

This seems like it should be unnecessary though. Are there better ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

昔梦 2024-12-11 04:32:20

您应该使用 FLATTEN，与 TOKENIZE 配合使用可以很好地完成类似的事情。

b = FOREACH a GENERATE name, FLATTEN(TOKENIZE(fruits)) as fruit, city;

FLATTEN 拿一个袋子并将其“压平”到不同的行上。 TOKENIZE 将你的水果分成一个袋子（不是像你说的那样的元组），然后 FLATTEN 执行你正在寻找的类似交叉的行为。我指出它是一个包而不是一个元组，因为 FLATTEN 已重载并且与元组的行为不同。

我首先在规范字数统计示例中了解了 FLATTEN/TOKENIZE 技术，其中对一个单词进行标记，然后将单词展平为行。

You should use FLATTEN, which works great with TOKENIZE to do stuff like this.

b = FOREACH a GENERATE name, FLATTEN(TOKENIZE(fruits)) as fruit, city;

FLATTEN takes a bag and "flattens" it out across different rows. TOKENIZE breaks your fruits out into a bag (not a tuple like you said), and then FLATTEN does the cross-like behavior like you are looking for. I point out that it is a bag and not a tuple, because FLATTEN is overloaded and behaves differently with tuples.

I first learned of the FLATTEN/TOKENIZE technique in the canonical word count example, in which is tokenizes a word, then flattens the words out into rows.

回复收藏 0 原文

~没有更多了~