HBase |使用HBase shell CMD隐藏的HBase Col预选赛，但可通过HBaserdd Spark Code可见

发布于 2025-01-28 17:08:15 字数 1690 浏览 6 评论 0原文

我遇到了与HBase设计有关的非常奇怪的情况。

Hbase version >> Version 2.1.0-cdh6.2.1

因此，问题陈述是在HBASE中，我们的表中有一排。当我们从下游收到数据时，我们执行新的插入，然后对同一HBase行进行后续更新。

say we received data like below
INSERT of {a=1,b=1,c=1,d=1,rowkey='row1'}
UPDATE of {b=1,c=1,d=1,rowkey='row1'}

和假设最后一行是这样的，在我们的HBase表中

hbase(main):008:0> get 'test', 'row1'
COLUMN      CELL
cf:b        timestamp=1288380727188, value=value1
cf:c        timestamp=1288380727188, value=value1
cf:d        timestamp=1288380727188, value=value1
1 row(s) in 0.0400 seconds

， cf：a ，上述数据中缺少列限制符，如上图所示，当通过扫描或获取命令获取时。但是根据我们的摄入流/过程，它应该在那里。因此，我们正在为它去哪里或发生了什么等方面进行分类。仍在进行分析，我们对它的位置毫无意义。

现在，简短的故事，我们有一个火花将HBASE表读为RDD， hbasecontext.hbaserdd api函数，将其转换为数据帧并显示表格数据。因此，我们在同一张桌子上运行了此火花util，以帮助定位这一行，非常令人惊讶的是，它为同一rowkey'Row1'返回了2行，其中第一行与上面的get/scan Row（上述数据）和第二行我们缺少列 cf：a （令人惊讶的是，它具有与预期相同的值）。假设输出数据帧出现在下面。

rowkey |cf:a |cf:b|cf:c|cf:d
row1   |null | 1  | 1  | 1         >> cf:a col qualifier missing (same as in Hbase shell)
row1   | 1   | 1  | 1  | 1         >> This cf:a was expected

我们也检查了HBASE表模式，因此我们没有多个版本的 cf：A 在描述中，或者我们在表上不做版本化。 HBASE表的描述的模式

VERSIONS => '1'

无论如何都有所描述的，我对HBASERDD如何能够读取该行或丢失的COL限定符，但通过GET读取HBase shell CMD，扫描没有读取缺失的COL预选赛或行。任何HBASE专家或建议。

仅供参考，我也通过该行上的get版本也尝试了HBase shell CMDS，但它仅返回上述数据，而不是缺少 cf：a 。

col预选赛 cf：a 标记为删除或类似的东西，hbase shell cmd所不显示的？任何帮助将不胜感激。

谢谢！！

原文

I am stuck in a very odd situation related to Hbase design i would say.

Hbase version >> Version 2.1.0-cdh6.2.1

So, the problem statement is, in Hbase, we have a row in our table.
We perform new insert and then subsequent updates of the same Hbase row, as we receive the data from downstream.

say we received data like below
INSERT of {a=1,b=1,c=1,d=1,rowkey='row1'}
UPDATE of {b=1,c=1,d=1,rowkey='row1'}

and
say the final row is like this in our Hbase table

hbase(main):008:0> get 'test', 'row1'
COLUMN      CELL
cf:b        timestamp=1288380727188, value=value1
cf:c        timestamp=1288380727188, value=value1
cf:d        timestamp=1288380727188, value=value1
1 row(s) in 0.0400 seconds

So, cf:a, column qualifier is missing in above data as visible above when fetched via scan or get commands. But as per our ingestion flow/process, it should have been there. So, we are triaging as to where it went or what happened and so on. Still the analysis is in process and we are kind of clueless as to where it is.

Now, cut story short, we have a spark util to read the Hbase table into a Rdd, via
hbasecontext.hbaseRdd API function, convert it into a dataframe and display the tabular data. So, we ran this spark util on the same table to help locate this row and very surprisingly it returned 2 rows for the this same rowkey 'row1', where 1st row was the same as above get/scan row (above data) and the 2nd row had our missing column cf:a (surprising it had the same value which was expected). Say the output dataframe appeared something like below.

rowkey |cf:a |cf:b|cf:c|cf:d
row1   |null | 1  | 1  | 1         >> cf:a col qualifier missing (same as in Hbase shell)
row1   | 1   | 1  | 1  | 1         >> This cf:a was expected

We checked our Hbase table schema as well, so we dont have multiple versions of the cf:a in the describe or we dont do versioning on the table. The schema of the Hbase table describe has

VERSIONS => '1'

Anyways, i am clueless as to how hbaseRdd is able to read that row or missing col qualifier, but the Hbase shell cmds via get, scans does not read the missing col qualifier or row.
Any Hbase expert or suggestions please.

Fyi, i tried Hbase shell cmds as well via get - versions on the row, but it only returns the above get data and not the missing cf:a.

Is the col qualifier cf:a marked for deletion or something like that, which the Hbase shell cmd doesn't show ?
Any help would be appreciated.

Thanks !!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

注定孤独终老 2025-02-04 17:08:15

这是一个奇怪的问题，我怀疑这与在不同时间具有不同的列预选赛的puts有关。但是，我只是试图重新创建这种行为，但我似乎并没有遇到这个问题。但是我有一个常规的HBase 2.x构建，而不是您的。

我建议更仔细地探索问题的一种选择是在HBase Shell之外对HFILE进行物理检查。您可以使用HBASE HFILE实用程序在HFILE级别打印物理键值内容。显然，尝试在一个小的HFILE上做到这一点！不过，请不要忘记在桌子上进行冲洗和重新处理，因为HBase可以在内存中存储所有更新。

您可以在下面启动该实用程序，它将顺序打印所有键值：

hbase hfile -f hdfs://HDFS-NAMENODE:9000/hbase/data/default/test/29cfaecf083bff2f8aa2289c6a078678/f/09f569670678405a9262c8dfa7af8924 -p --printkv

在上述命令中，hdfs-nemenode是您的HDFS服务器，default是您的namespace（假设您没有），test是您的表名称，f是列家族名称。您可以通过递归使用HDFS浏览命令来找到HFILE的确切路径：

hdfs dfs -ls /hbase/data

This is a strange problem, which I suspect has to do with puts with the same rowkey having different column qualifiers at different times. However, I just tried to recreate this behaviour and I don't seem to be getting this problem. But I have a regular HBase 2.x build, as opposed to yours.

One option I would recommend to explore the problem more closely is to inspect the HFiles physically, outside of hbase shell. You can use the HBase HFile utility to print the physical key-value content at the HFile level. Obviously try to do this on a small HFile! Don't forget to flush and major-compact your table before you do it though, because HBase stores all updates in memory while it can.

You can launch the utility as below, and it will print all key-values sequentially:

hbase hfile -f hdfs://HDFS-NAMENODE:9000/hbase/data/default/test/29cfaecf083bff2f8aa2289c6a078678/f/09f569670678405a9262c8dfa7af8924 -p --printkv

In the above command, HDFS-NAMENODE is your HDFS server, default is your namespace (assuming you have none), test is your table name, and f is the column family name. You can find out the exact path to your HFiles by using the HDFS browse command recursively: