Hbase 中的领先和滞后

发布于 2024-12-20 19:14:02 字数 973 浏览 2 评论 0原文

我试图弄清楚如何执行相当于 Oracle 的 LEAD 和 LAG 在 Hbase 或其他某种模式中可以解决我的问题。我可以编写一个 MapReduce 程序来非常轻松地完成此操作，但我希望能够利用数据已经按照我需要的方式进行排序的事实。

我的问题如下：我有一个行键和一个如下所示的值：

(employee name + timestamp) => data:salary

因此，一些示例数据可能是：

miller, bob;2010-01-14 => data:salary=90000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2011-12-03 => data:salary=107000
monty, fred;2010-04-10 => data:salary=19000
monty, fred;2011-09-09 => data:salary=24000

我想要做的是逐条记录地计算工资的变化。我想将上述数据转换为记录之间的差异：

miller, bob;2010-01-14 => data:salarydiff=90000
miller, bob;2010-11-04 => data:salarydiff=12000
miller, bob;2011-12-03 => data:salarydiff=5000
monty, fred;2010-04-10 => data:salarydiff=19000
monty, fred;2011-09-09 => data:salarydiff=5000

如有必要，我准备更改行键策略。

原文

I'm trying to figure out how to do the equivalent of Oracle's LEAD and LAG in Hbase or some other sort of pattern that will solve my problem. I could write a MapReduce program that does this quite easily, but I'd love to be able to exploit the fact that the data is already sorted in the way I need it to be.

My problem is as follows: I have a rowkey and a value that looks like:

(employee name + timestamp) => data:salary

So, some example data might be:

miller, bob;2010-01-14 => data:salary=90000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2011-12-03 => data:salary=107000
monty, fred;2010-04-10 => data:salary=19000
monty, fred;2011-09-09 => data:salary=24000

What I want to do is calculate the changes of salary, record by record. I want to transform the above data into differences between records:

miller, bob;2010-01-14 => data:salarydiff=90000
miller, bob;2010-11-04 => data:salarydiff=12000
miller, bob;2011-12-03 => data:salarydiff=5000
monty, fred;2010-04-10 => data:salarydiff=19000
monty, fred;2011-09-09 => data:salarydiff=5000

I'm up for changing the rowkey strategy if necessary.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

握住我的手 2024-12-27 19:14:02

我要做的就是更改密钥，以便时间戳降序（首先是较新的工资）

miller, bob;2011-12-03 => data:salary=107000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2010-01-14 => data:salary=90000

现在您可以执行一个简单的映射作业来扫描表。然后在地图中创建一个新的“扫描到当前键”。 Scan.next 获取之前的工资，计算差异并将其存储在当前行键的新列中
基本上在您的映射器类（继承 TableMapper 的类）中，您重写设置方法并获取配置

@Override
protected void setup(Mapper.Context context) throws IOException,InterruptedException {
    Configuration config = context.getConfiguration();
    table = new HTable(config,<Table Name>);
}

然后在映射内从行参数中提取行键，创建新的 Scan 并继续，如上所述

在大多数情况下，下一条记录将位于同一区域 - 有时它可能会转到另一个区域服务器

What I'd do is change the key so that the timestamp will be descending (newer salary first)

miller, bob;2011-12-03 => data:salary=107000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2010-01-14 => data:salary=90000

Now you can do a simple map job that will scan the table. Then in the map you create a new Scan to the current key. Scan.next to get the previous salary, calculate the diff and store it in a new column on the current row key
Basically in your mapper class (the one that inherits TableMapper) you override the setup method and get the configuration

@Override
protected void setup(Mapper.Context context) throws IOException,InterruptedException {
    Configuration config = context.getConfiguration();
    table = new HTable(config,<Table Name>);
}

Then inside the map you extract the row key from the row parmeter, create the new Scan and continue as explained above

In most cases the next record would be in the same region - occasionally it might go to another regionserver

回复收藏 0 原文

~没有更多了~