HIVE/HDFS 用于大规模实时存储传感器数据?

发布于 2024-12-22 02:00:30 字数 406 浏览 6 评论 0原文

我正在评估具有以下要求的传感器数据收集系统,

  1. 100 万个端点每分钟发送 100 字节的数据(作为时间序列)。
  2. 基本上对存储进行了数百万次小写入。

    此数据是一次性写入的,因此基本上永远不会更新。
    
  3. 访问要求
    一个。需要定期访问用户的完整数据(频率较低)
    b.用户的部分数据需要定期(更频繁)访问。例如,我需要在过去一小时/天/周/月收集的传感器数据进行分析/报告。

已开始将 Hive/HDFS 作为一种选择。有人可以评论一下 Hive 在这种用例中的适用性吗?我担心虽然分布式存储需求可行,但它似乎比实时数据收集/存储更适合数据仓库应用程序。

HBase/Cassandra 在这种情况下更有意义吗?

I am evaluating sensor data collection systems with the following requirements,

  1. 1 million endpoints sending in 100 bytes of data every minute (as a time series).
  2. Basically millions of small writes to the storage.

    This data is write-once, so basically it never gets updated.
    
  3. Access requirements
    a. Full data for a user needs to be accessed periodically (less frequent)
    b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.

Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.

Do HBase/Cassandra make more sense in this scenario?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

诺曦 2024-12-29 02:00:30

我认为 HBase 对您来说是一个不错的选择。事实上,HBase 中已经有一个开源实现,可以解决您可能想要使用的类似问题。看一下openTSB,它是一个解决类似问题的开源实现。以下是他们简介的简短摘录:

OpenTSDB 是一个分布式、可扩展的时间序列数据库 (TSDB)
写在HBase之上。 OpenTSDB 的编写是为了解决一个常见问题
需求:存储、索引和服务从计算机系统收集的指标
大规模(网络设备、操作系统、应用程序),以及
使这些数据易于访问和图形化。感谢 HBase 的
可扩展性,OpenTSDB 允许您收集数千个指标
来自数千个主机和应用程序,以很高的速度(每隔几个
秒)。 OpenTSDB 永远不会删除或缩减采样数据,并且可以轻松地
存储数十亿个数据点。事实上,StumbleUpon 使用
它可以跟踪数十万个时间序列并收集
主要生产中每天有超过 6 亿个数据点
数据中心。

I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:

OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.

奈何桥上唱咆哮 2024-12-29 02:00:30

实际上有不少人使用 Cassandra 以时间序列方式收集传感器数据。这是一个非常合适的选择。我建议您阅读这篇有关 Cassandra 中基本时间序列的文章 了解您的数据模型是什么样的。

Cassandra 中的写入非常便宜,因此即使是中等大小的集群也可以轻松处理每分钟 100 万次写入。

您的两个阅读查询都可以得到非常有效的答复。对于第二种类型的查询,您正在读取单个传感器的一段时间的数据,您最终会从单行读取连续的切片;完全冷读大约需要 10 毫秒。对于第一种类型的查询,您只需并行运行多个每个传感器查询。假设您将用户的基本映射存储到传感器 ID,您将通过一个查询查找某个用户的所有传感器 ID,然后您的第二个查询将获取所有这些传感器的数据(尽管您可能会在以下情况下分解此查询)传感器数量较多)。

当您谈论实时查询时,Hive 和 HDFS 并没有真正的意义,因为它们更适合长时间运行的批处理作业。

There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.

Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.

Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).

Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文