Hbase 中的领先和滞后
我试图弄清楚如何执行相当于 Oracle 的 LEAD 和 LAG 在 Hbase 或其他某种模式中可以解决我的问题。我可以编写一个 MapReduce 程序来非常轻松地完成此操作,但我希望能够利用数据已经按照我需要的方式进行排序的事实。
我的问题如下: 我有一个行键和一个如下所示的值:
(employee name + timestamp) => data:salary
因此,一些示例数据可能是:
miller, bob;2010-01-14 => data:salary=90000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2011-12-03 => data:salary=107000
monty, fred;2010-04-10 => data:salary=19000
monty, fred;2011-09-09 => data:salary=24000
我想要做的是逐条记录地计算工资的变化。我想将上述数据转换为记录之间的差异:
miller, bob;2010-01-14 => data:salarydiff=90000
miller, bob;2010-11-04 => data:salarydiff=12000
miller, bob;2011-12-03 => data:salarydiff=5000
monty, fred;2010-04-10 => data:salarydiff=19000
monty, fred;2011-09-09 => data:salarydiff=5000
如有必要,我准备更改行键策略。
I'm trying to figure out how to do the equivalent of Oracle's LEAD and LAG in Hbase or some other sort of pattern that will solve my problem. I could write a MapReduce program that does this quite easily, but I'd love to be able to exploit the fact that the data is already sorted in the way I need it to be.
My problem is as follows: I have a rowkey and a value that looks like:
(employee name + timestamp) => data:salary
So, some example data might be:
miller, bob;2010-01-14 => data:salary=90000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2011-12-03 => data:salary=107000
monty, fred;2010-04-10 => data:salary=19000
monty, fred;2011-09-09 => data:salary=24000
What I want to do is calculate the changes of salary, record by record. I want to transform the above data into differences between records:
miller, bob;2010-01-14 => data:salarydiff=90000
miller, bob;2010-11-04 => data:salarydiff=12000
miller, bob;2011-12-03 => data:salarydiff=5000
monty, fred;2010-04-10 => data:salarydiff=19000
monty, fred;2011-09-09 => data:salarydiff=5000
I'm up for changing the rowkey strategy if necessary.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我要做的就是更改密钥,以便时间戳降序(首先是较新的工资)
现在您可以执行一个简单的映射作业来扫描表。然后在地图中创建一个新的“扫描到当前键”。 Scan.next 获取之前的工资,计算差异并将其存储在当前行键的新列中
基本上在您的映射器类(继承 TableMapper 的类)中,您重写设置方法并获取配置
然后在映射内从行参数中提取行键,创建新的 Scan 并继续,如上所述
在大多数情况下,下一条记录将位于同一区域 - 有时它可能会转到另一个区域服务器
What I'd do is change the key so that the timestamp will be descending (newer salary first)
Now you can do a simple map job that will scan the table. Then in the map you create a new Scan to the current key. Scan.next to get the previous salary, calculate the diff and store it in a new column on the current row key
Basically in your mapper class (the one that inherits TableMapper) you override the setup method and get the configuration
Then inside the map you extract the row key from the row parmeter, create the new Scan and continue as explained above
In most cases the next record would be in the same region - occasionally it might go to another regionserver