是否可以将关系中的一行与 Pig 中该行中的元组交叉连接?
我有一组数据,显示用户、他们喜欢的水果集合以及家乡城市:
Alice\tApple:Orange\tSacramento
Bob\tApple\tSan Diego
Charlie\tApple:Pineapple\tSacramento
我想创建一个猪查询,将不同城市喜欢水果类型的用户数量关联起来,其中查询的结果是上面的数据看起来像这样:
Apple\tSacramento\t2
Apple\tSan Diego\t1
Orange\tSacramento\t1
Pineapple\tSacramento\t1
我不明白的部分是如何将分割的水果行与同一行的其余数据交叉连接,所以:
Alice\tApple:Orange\tSacramento
变成:
Alice\tApple\tSacramento
Alice\tOrange\tSacramento
我知道我可以使用 TOKENIZE 来分割字符串“苹果:橙子”进入元组('Apple','Orange'),但我不知道如何获得该元组与行的其余部分('Alice')的叉积。
我想出的一个强力解决方案是使用流式传输通过外部程序运行输入集合,并处理“交叉连接”以在其中每行生成多行。
但这似乎应该是不必要的。还有更好的想法吗?
I have a set of data that shows users, collections of fruit they like, and home city:
Alice\tApple:Orange\tSacramento
Bob\tApple\tSan Diego
Charlie\tApple:Pineapple\tSacramento
I would like to create a pig query that correlates the number of users that enjoy tyeps of fruits in different cities, where the results from the query for the data above would look like this:
Apple\tSacramento\t2
Apple\tSan Diego\t1
Orange\tSacramento\t1
Pineapple\tSacramento\t1
The part I can't figure out is how to cross join the split fruit rows with the rest of the data from the same row, so:
Alice\tApple:Orange\tSacramento
becomes:
Alice\tApple\tSacramento
Alice\tOrange\tSacramento
I know I can use TOKENIZE to split the string 'Apple:Orange' into the tuple ('Apple', 'Orange'), but I don't know how to get the cross product of that tuple with the rest of the row ('Alice').
One brute-force solution I came up with is to use the streaming to run the input collection through an external program, and handle the "cross join" to produce multiple rows per row there.
This seems like it should be unnecessary though. Are there better ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该使用
FLATTEN
,与TOKENIZE
配合使用可以很好地完成类似的事情。FLATTEN
拿一个袋子并将其“压平”到不同的行上。TOKENIZE
将你的水果分成一个袋子(不是像你说的那样的元组),然后FLATTEN
执行你正在寻找的类似交叉的行为。我指出它是一个包而不是一个元组,因为FLATTEN
已重载并且与元组的行为不同。我首先在规范字数统计示例中了解了
FLATTEN
/TOKENIZE
技术,其中对一个单词进行标记,然后将单词展平为行。You should use
FLATTEN
, which works great withTOKENIZE
to do stuff like this.FLATTEN
takes a bag and "flattens" it out across different rows.TOKENIZE
breaks your fruits out into a bag (not a tuple like you said), and thenFLATTEN
does the cross-like behavior like you are looking for. I point out that it is a bag and not a tuple, becauseFLATTEN
is overloaded and behaves differently with tuples.I first learned of the
FLATTEN
/TOKENIZE
technique in the canonical word count example, in which is tokenizes a word, then flattens the words out into rows.