Hadoop Pig排序结果;找到订单位置?
我想对我的猪结果进行排序,然后能够确定某些项目在我的排序结果中的位置。示例:
mydata = LOAD 'mydata.txt' AS (label:chararray, rank_score:float);
ranked_data = ORDER mydata BY rank_score DESC;
ranked_positions = FOREACH ranked_data GENERATE label, AUTO_INCREMENT_ID;
results = FILTER ranked_data BY label = 'item1' OR label='item2';
DUMP results;
AUTO_INCRMENT_ID
在我的完美世界中会自动递增。考虑到映射器/减速器是如何相互独立的,我猜测 Pig/Hadoop 可能不支持这一点。如果没有,你能想出另一种方法来生成我的最终结果吗?
输入示例:
item1 34.33
item2 48.39
item3 93.3
所需输出:
item1 3
item2 2
I want to sort my pig results, and then be able to determine where certain items are in my ordered results. Example:
mydata = LOAD 'mydata.txt' AS (label:chararray, rank_score:float);
ranked_data = ORDER mydata BY rank_score DESC;
ranked_positions = FOREACH ranked_data GENERATE label, AUTO_INCREMENT_ID;
results = FILTER ranked_data BY label = 'item1' OR label='item2';
DUMP results;
AUTO_INCREMENT_ID
would auto-increment in my perfect world. Given how mappers/reducers are independent from each other, I'm guessing Pig/Hadoop may not support this. If not, can you think of another way to generate my end result?
Example input:
item1 34.33
item2 48.39
item3 93.3
Desired output:
item1 3
item2 2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果将 ORDER 的并行度设置为 1,则可以在 udf 中自行自增;当然,仅使用 1 个减速器进行排序可能会产生潜在的不良影响。
(另外,我不确定你是如何获得示例输出的——输入似乎已经排序,所以 item1 应该有 id 1,item 2 应该有 id 2,对吧?你的意思是按rank_score desc 排序吗? ?)
If you set parallelism of ORDER to 1, you can just do auto-increment yourself in a udf; of course, that would have the potentially undesired effect of only using 1 reducer to do your sorting.
(Also, I am not sure how you got your example output -- the input seems to be already ordered, so item1 should have id 1 and item 2 should have id 2, right? did you mean to order by rank_score desc?)