使用HBase存储时间序列数据
我们正在尝试使用HBase来存储时间序列数据。我们当前的模型将时间序列存储为单元格内的版本。这意味着单元最终可能会存储数百万个版本,并且对此时间序列的查询将使用 HBase 中 Get 类上可用的 setTimeRange 方法检索一系列版本。
例如,
{
"row1" : {
"columnFamily1" : {
"column1" : {
1 : "1",
2 : "2"
},
"column2" : {
1 : "1"
}
}
}
}
这是在 HBase 中存储时间序列数据的合理模型吗?
将数据存储在多列(是否可以跨列查询)或行中的替代模型是否更合适?
We are trying to use HBase to store time-series data. The model we have currently stores the time-series as versions within a cell. This implies that the cell could end up storing millions of versions, and the queries on this time-series would retrieve a range of versions using the setTimeRange method available on the Get class in HBase.
e.g.
{
"row1" : {
"columnFamily1" : {
"column1" : {
1 : "1",
2 : "2"
},
"column2" : {
1 : "1"
}
}
}
}
Is this a reasonable model to store time-series data in HBase?
Is the alternate model of storing data in multiple columns (is it possible to query across columns) or rows more suitable?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为您不应该使用版本控制来存储此处的时间序列。不是因为它不起作用,而是因为它不是为特定用例设计的,而且还有其他方法。
我建议您将时间序列存储为时间步长作为列限定符,并且值将是数据本身。类似这样的:
这里的一件好事是 HBase 按排序顺序存储列限定符,因此当读回时间序列时,您应该按顺序看到项目。
另一个现实的选择是将记录的标识符作为行键的第一部分,然后在行键中也包含时间步长。类似于:
它有一个很好的功能,即可以很容易地在特定系列中进行范围扫描。例如,提取 fooseries 的步骤 104 到 199 实施起来非常简单,而且效率很高。
这样做的缺点是删除整个系列将需要更多的管理和同步。另一个缺点是 MapReduce 分析将很难对这些数据进行任何类型的分析。通过上述方法,整个时间序列将传递给一个
map()
调用,而这里,将为每一帧调用map()
。I don't think you should use versioning to store the time series here. Not because it won't work, but because it's not designed for that particular use case and there are other ways.
I suggest you store the time series as the time step as the column qualifier and the value will be the data itself. Something like:
One nice thing here is that HBase stores the column qualifiers in sorted order, so when reading the time series back you should see the items in order.
Another realistic option would be to have the identifier for the record as the first part of the rowkey, but then have the time step in the rowkey as well. Something like:
This has the nice feature that it'll be pretty easy to do range scans in a particular series. For example, pulling out fooseries's steps 104 to 199 is going to be pretty trivial to implement and be efficient.
The downside to this one is deleting an entire series is going to require a bit more management and synchronization. Another downside is that MapReduce analytics are going to have a hard time doing any sort of analysis on this data. With the above approach, the entire time series will be passed to one
map()
call, while here,map()
will be called for each frame.openTSDB +1 它有很多技巧来简化基于时间的汇总查询。
至于原始问题,您可以拥有任意多个单元版本(没有限制)。没有性能损失,“Get”在 HBase 中无论如何都被实现为 Scan,并且 setTimeRange 是非常有效的过滤器。
+1 for openTSDB It does many tricks to simplify time-based rollup queries.
As for original question, you can have as many cell versions as you want (there is no limit). There is no performance penalty, 'Get' is implemented as Scan anyway in HBase and setTimeRange is quite effective filter.