使用比较过滤器的 HBase 扫描在返回最后一行时有很长的延迟
我的 HBase 在独立模式下运行,并且在使用 Java API 查询表时遇到了一些问题。 该表有几百万个条目(但可能会增长到数十亿个),它们具有以下行关键指标:
<UUID>-<Tag>-<Timestamp>
我使用两个比较操作过滤器来查询表示时间间隔的特定行范围。
Scan scan = new Scan();
RowFilter upperRowFilter = new RowFilter(CompareOp.LESS,
new BinaryComparator(securityId + eventType + intervalEnd)
.getBytes()));
RowFilter lowerRowFilter = new RowFilter(CompareOp.GREATER_OR_EQUAL,
new BinaryComparator(securityId + eventType + intervalStart)
.getBytes()));
FilterList filterList = new FilterList();
filterList.addFilter(lowerRowFilter);
filterList.addFilter(upperRowFilter);
scan.setFilter(filterList);
scanner = table.getScanner(scan);
result = scanner.next();
当我调用 ResultScanner#next() 方法时,一切正常,直到到达最后一个 通过过滤器指定的键范围的行。最多需要 40 秒 直到 ResultScanner 返回最后一行,该行在词法上小于上面的行 行范围限制。
当我将 filterList 中的过滤器顺序从 更改
filterList.addFilter(lowerRowFilter);
filterList.addFilter(upperRowFilter);
为 时,
filterList.addFilter(upperRowFilter);
filterList.addFilter(lowerRowFilter);
扫描仪需要长达 40 秒的时间才能开始返回任何结果,但没有 返回最后一行时有更多延迟,因此我认为延迟来自 CompareOp.LESS - 过滤器。
我知道解决此延迟的唯一方法是省略 upperRowFilter 并手动检查行键是否超出范围,但我确信一定有问题,因为我在互联网上搜索时没有发现任何问题。
我也已经尝试通过缓存摆脱这个问题,但是当我使用小于返回的行数的缓存大小时,它不会改变任何内容,并且如果我使用大于返回的行数的缓存大小,则会延迟仍然存在,但在返回任何结果之前再次出现。
您知道什么会导致这种行为吗?我做错了吗?还是我遗漏了什么?
提前致谢!
I have HBase running in standalone mode and encountered some problems when I query the tables using the Java API.
The table has several million entries (but might grow to billions) which have the following row key metric :
<UUID>-<Tag>-<Timestamp>
I use two compare-operation filters to query a specific row range which represents a time interval.
Scan scan = new Scan();
RowFilter upperRowFilter = new RowFilter(CompareOp.LESS,
new BinaryComparator(securityId + eventType + intervalEnd)
.getBytes()));
RowFilter lowerRowFilter = new RowFilter(CompareOp.GREATER_OR_EQUAL,
new BinaryComparator(securityId + eventType + intervalStart)
.getBytes()));
FilterList filterList = new FilterList();
filterList.addFilter(lowerRowFilter);
filterList.addFilter(upperRowFilter);
scan.setFilter(filterList);
scanner = table.getScanner(scan);
result = scanner.next();
When I call the ResultScanner#next() method everything works fine until it gets to the last
row of the key range which is specified through the filters. It takes up to 40 seconds
until the ResultScanner returns the last row, which is lexically smaller than the upper
row range limit.
When I change the order of the filters in the filterList from
filterList.addFilter(lowerRowFilter);
filterList.addFilter(upperRowFilter);
to
filterList.addFilter(upperRowFilter);
filterList.addFilter(lowerRowFilter);
it takes the scanner up to 40 seconds until it starts to return any results but there is no
more delay on returning the last row, so I figured that the delay comes from the CompareOp.LESS - filter.
The only way I know of to get around this delay is to omit the upperRowFilter and check manually if the row keys are out of range but I am sure there has to be something wrong, because I found nothing on that problem searching the internet.
I also already tried to get rid of that with caching, but when I use a cache size which is less than the number of rows returned it doesn't change anything and if I use a cache size bigger than the number of rows returned the delay is still there but again before any results are returned.
Do you have any idea what could cause that kind of behaviour? Am I doing it wrong or is there something that I'm missing?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题是您的扫描器正在扫描整个表并丢弃与您的查询不匹配的结果。您需要显式设置(securityId + eventType + IntervalEnd)的停止行。如果您设置相应的起始行(securityId + eventType + IntervalStart),那么您根本不需要过滤器,并且无论数据集大小如何,扫描都会高效。
The problem is that your scanner is scanning the entire table and throwing away the results that don't match your query. You need to explicitly set a stop row of (securityId + eventType + intervalEnd). If you set a corresponding start row of (securityId + eventType + intervalStart), then you won't need a filter at all and the scan will be efficient no matter the size of your data set.