HBase 架构帮助
我有 SQL Server 背景,是 HBase 方面的新手,但该技术看起来非常适合我们正在做的事情,而且成本绝对合适!
我需要维护一个日志条目列表,通常我会在 RDBS 中创建这些条目,如下所示:
create table Log ( 用户ID int、SiteID int、页面varchar(50)、日期smalldatetime )
其中一个用户在这个简单的表中可能有 0 或 1000 行。典型的查询是在一个站点上查找一位用户的所有行或一位用户的所有行。
这是如何转化为 HBase 中没有“行键”并且相同的(SiteID,Page)可能出现多次的“地图”。我的第一个想法是 UserID 是一个行键,但我仍然不太了解“列族”和其他术语,无法理解如何设置表来保存此数据,其中一个 UserID 可以有多个(SiteID,Page ,日期)“行”。
任何方向表示赞赏!
Coming from a SQL Server background, I'm a newbie with regard to HBase, but the technology looks to be a good fit for what we're doing and the cost is definitely right!
I need to maintain a list of log entries which normally I would create in an RDBS as:
create table Log
(
UserID int, SiteID int, Page varchar(50), Date smalldatetime
)
where one user may have 0 or 1000 rows in this simple table. Typical queries would be to find all the rows for one user or all the rows for one user on one site.
How does this translate into a "map" in HBase where there is no "row key" AND the same (SiteID,Page) may appear many times. My first thought is that UserID is a row key, but I still don't understand "column families" and the other terminology well enough to understand how to setup the table to hold this data where the one UserID can have many (SiteID,Page,Date) "rows".
Any direction is appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我的建议是将您的 UserId 作为 Rowkey,给出任何单个列族,因为不必要地给出多个列族只会增加搜索所需的时间,并给出 siteId|date 作为列限定符,以便它始终是唯一的,并且该限定符的值将是您的页面。
`
希望它有效!
My suggestion would be to give your UserId as the Rowkey, Give any single column family as giving multiple column family unnecessary will only increase the time taken for seeks, and give siteId|date as the column qualifier so that it is always unique and value of that qualifier will be your page.
`
hope it works!
最初只需将其视为
代表 - 12_Aug_2013_00:00 :*-Temp=24, -Humidity=15 , -FileghtsDelayed=17
现在,查看更深入一点,如果我们可以将限定符分组到一个列族中会怎么样。
例如:
让组、组*No_FileghtsDelayed*、*No_FlightsCancelled*,作为eventsConts
我们有WeatherDetails,& eventsConts,作为列族
我们有 - Date_Hour : WeatherDetails : EventDetails:
例如,对于 12_Auguest_2013 FirstHour 记录的数据可以表示为
此分组是为了优化获取操作。
Initialy just look at it as
to represent - 12_Aug_2013_00:00 :*-Temp=24, -Humidity=15, -FileghtsDelayed=17
Now, look a little more deep, What if we can group the qualifiers into a column family .
eg:
Lets group, group *No_FileghtsDelayed*, *No_FlightsCancelled*, as eventsConts
We have WeatherDetails , & eventsConts, as column families
We have - Date_Hour : WeatherDetails : EventDetails:
eg, for 12_Auguest_2013 FirstHour Data Recorded could be represented as
This grouping is to optimize the fetch operation.
一种方法是从您的 userid+siteid 中创建复合行键
设置表以维护给定页面所需的任意数量的日志条目,并每次将数据存储为新版本(如有必要,手动设置时间戳)。
由于 HBase 维护每个单元的时间戳,因此您不需要单独的列来存储访问时间。
因此,您将拥有一个包含类似内容的表
来处理您的两个示例请求:
为了查找所有用户行,您将执行从 userx:0 到 userx+1:0 的扫描(确保设置 maxVersion),然后解析出来每个结果行中的站点 ID
要获取特定用户/站点的所有页面,只需从 userx:sitex 到 userx:sitex+1 进行扫描。最后我检查过你不能在 get 上设置 maxVersions,所以这不是一个选项。
简而言之,列族代表您想要存储在一起的数据组......
想必您会经常同时从它们读取数据。将列放置在不同的族中会导致数据被单独存储,因此当您只需要一列时,您可以获得更快的读取速度,但您需要读取 2 个不同的位置才能获取两列。
当然,根据您的其他需求,您可能需要采取不同的方法。我强烈建议阅读 big table 论文,以更好地理解 HBase 的结构(因为它强烈基于 bigtable)。
为了更好地了解 HBase 的内部结构,Lars George 的博客也很棒。
One approach would be to make compound row keys out of your userid+siteid
Set the table to maintain a however many log entries you want for a given page, and store your data as new versions each time(manually setting the timestamp if necessary).
Since HBase maintains timestamps for each cell, you don't need a separate column for the access time.
You would thus have a table with contents something like
To deal with your two example requests:
For finding all user rows you would do a scan(be sure to set maxVersion) from userx:0 to userx+1:0, and then parse out the site ids from each results row
To get all pages for a specific user/site just do a scan from userx:sitex to userx:sitex+1. Last I checked you can't set maxVersions on a get, so that isn't an option.
To put it simply, column families represent groups of data that you want stored together...
Presumably you would be reading data from them simultaneously quite often. Placing columns in separate families would result in the data being stored separately, so you get faster reads when you only want one column, but you need to read 2 different places to get both columns.
Of course depending on your other needs you may want to take a different approach. I would strongly recommend reading the big table paper to better understand the structure of HBase(since it is strongly based on bigtable).
To better understand the internals of HBase, Lars George's blog is also great.