HIVE/HDFS 用于大规模实时存储传感器数据?
我正在评估具有以下要求的传感器数据收集系统,
- 100 万个端点每分钟发送 100 字节的数据(作为时间序列)。
基本上对存储进行了数百万次小写入。
此数据是一次性写入的,因此基本上永远不会更新。
- 访问要求
一个。需要定期访问用户的完整数据(频率较低)
b.用户的部分数据需要定期(更频繁)访问。例如,我需要在过去一小时/天/周/月收集的传感器数据进行分析/报告。
已开始将 Hive/HDFS 作为一种选择。有人可以评论一下 Hive 在这种用例中的适用性吗?我担心虽然分布式存储需求可行,但它似乎比实时数据收集/存储更适合数据仓库应用程序。
HBase/Cassandra 在这种情况下更有意义吗?
I am evaluating sensor data collection systems with the following requirements,
- 1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
- Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为 HBase 对您来说是一个不错的选择。事实上,HBase 中已经有一个开源实现,可以解决您可能想要使用的类似问题。看一下openTSB,它是一个解决类似问题的开源实现。以下是他们简介的简短摘录:
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
实际上有不少人使用 Cassandra 以时间序列方式收集传感器数据。这是一个非常合适的选择。我建议您阅读这篇有关 Cassandra 中基本时间序列的文章 了解您的数据模型是什么样的。
Cassandra 中的写入非常便宜,因此即使是中等大小的集群也可以轻松处理每分钟 100 万次写入。
您的两个阅读查询都可以得到非常有效的答复。对于第二种类型的查询,您正在读取单个传感器的一段时间的数据,您最终会从单行读取连续的切片;完全冷读大约需要 10 毫秒。对于第一种类型的查询,您只需并行运行多个每个传感器查询。假设您将用户的基本映射存储到传感器 ID,您将通过一个查询查找某个用户的所有传感器 ID,然后您的第二个查询将获取所有这些传感器的数据(尽管您可能会在以下情况下分解此查询)传感器数量较多)。
当您谈论实时查询时,Hive 和 HDFS 并没有真正的意义,因为它们更适合长时间运行的批处理作业。
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.