高采样率数据的数据库设计,多种缩放级别的绘图
我有多个传感器向我的网络应用程序提供数据。每个通道每秒 5 个样本,数据以 1 分钟 json 消息捆绑在一起上传(包含 300 个样本)。 数据将使用 flot 以 1 天到 1 分钟的多个缩放级别绘制成图表。
我正在使用 Amazon SimpleDB,目前将数据存储在我收到的 1 分钟块中。这对于高缩放级别非常有效,但对于全天来说,将有太多行需要检索。
我目前的想法是,每小时我都可以抓取数据并收集过去一小时的 300 个样本,并将它们存储在另一个表中,本质上是对数据进行下采样。
这听起来是一个合理的解决方案吗?其他人如何实施同类系统?
I've got multiple sensors feeding data to my web app. Each channel is 5 samples per second and the data gets uploaded bundled together in 1 minute json messages (containing 300 samples).
The data will be graphed using flot at multiple zoom levels from 1 day to 1 minute.
I'm using Amazon SimpleDB and I'm currently storing the data in the 1 minute chunks that I receive it in. This works well for high zoom levels, but for full days there will be simply be too many rows to retrieve.
The idea I've currently got is that every hour I can crawl through the data and collect together 300 samples for the last hour and store them in another table, essentially down-sampling the data.
Does this sound like a reasonable solution? How have others implemented the same sort of systems?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
存储下采样数据是非常好的方法。
查看 munin 如何存储其图表 - 每日、每月、早期和日内图表分别存储在那里。
您可以将每分钟、每 5 分钟、每小时、每 4 小时、每天的数据存储在不同的表中。与仅存储每一分钟相比,开销非常小,而且有很多好处,因为您不会传输不需要的内容。
Storing downsampled data is perfectly fine approach.
Check out how munin stores it's graphs - dayly, mounthly, early and intraday graphs are stored separately there.
You may store data for each minute, each 5 minutes, each hour, each 4 hours, each day in different tables. Overhead is very little in comparison to just storing every minute with lots of benefit as you don't transmit what you don't need to.
加快数据库速度,使用直接组织模型。这是从文件存储/检索数据的最快方法。实现非常简单,您不需要任何框架或库。
方法是:
原生数据
您每天可以创建一个数据文件,以便于维护。那么你的钥匙就是号。一天内的样本。因此,您的每日文件将为 18000 * 24 * 记录大小。您应该用 0 预先创建该文件,以使操作系统的生活更轻松(也许它没有多大帮助,它取决于底层文件系统/缓存机制)。
因此,当数据到达时,计算文件位置,并将记录插入到其位置。
汇总数据
您也应该将汇总数据存储在直接文件中。这些文件会小得多。如果是 1 分钟汇总值,则其中将有 24*60*60 条记录。
您必须做出一些决定:
另一件事是要考虑,汇总数据的创建时间。虽然本机数据应该在数据到达时就存储,但汇总数据可以随时计算:
不要忘记,在不久前,这些问题还是数据库设计问题。我可以保证一件事:它将很快,比任何东西都快(除了使用内存来存储数据)。
Speed up the database, use direct organization model. It's the fastest method to store/retrieve data from files. The implementation is as simple, that you don't need any framework or library.
The method is:
Native data
You may create one data file per day for easier maintenance. Then your key is the no. of the sample within the day. So, your daily file will be 18000 * 24 * record size. You should pre-create that file with 0s in order to make operating system's life easier (maybe it does not help much, it depends on underlying filesystem/caching mechanism).
So, when a data arrives, calculate the file position, and insert the record to its place.
Summarized data
You should store summarized data in direct files, too. These files will be much smaller ones. In case of the 1-minute summarized values there will be 24*60*60 records in it.
There're some decisions, which you have to take:
Another thing is to think about, the creation time of the summarized data. While native data should be stored just as the data arrives, summarized data may be calculated any time:
Don't forget, not-too-many years ago these issues were the database design issues. I can promise one thing: it will be fast, faster than anything (except using memory for storing data).
我前段时间通过对一些图表进行动态下采样来实现这一点。缺点是年纪大了会失去分辨率,但我相信这对你来说是可以接受的。如果您对峰值感兴趣,您可以提供最大值、平均值和最小值。
该算法也不太难。如果每秒有 5 个样本,并且想要保持此粒度大约一小时,则必须在这一小时内存储 5*60*60 = 18000 个样本。
当天,您可能会减少到每 5 秒 1 个样本,将数量减少 25 倍。然后算法将每 5 秒运行一次,并计算 24 小时前过去 5 秒的中位数、最小值和最大值。结果每天会多出 12*60*23 = 16560 个样本,如果您
进一步存储,我建议每分钟保存一个样本,大约两周内减少 12 个样本,因此您还有 60*24*13 = 18720 个样本两周数据。
在数据库中存储数据时应特别考虑。为了获得最大性能,您应该确保一个传感器的数据存储在数据库的一个块中。如果您使用例如 PostgreSQL,您就会知道一个块的长度为 8192 字节,并且一个块中不会存储两条记录。假设一个样本有 4 字节长度,并考虑到每行的开销,我可以在一个块中添加 2048 减去一些样本。考虑到最高分辨率,这是 2040 / 5 / 60 = 6 分钟的数据。现在最好总是一次添加 6 分钟,也许 5 分钟只是在稍后的几分钟内更新的虚拟数据,这样查询就可以更快地获取单个传感器的块。
至少我会针对不同的传感器粒度使用不同的表。
I implemented this some time ago with downsampling on the fly for some graphs. The drawback is that older looses resolution, but I believe this is acceptable for you. And if you are interested in peaks you could provide max, avg, and min values.
The algorithm isn't too hard also. If you have 5 samples per second and want to hold this granularity for maybe an hour you have to store 5*60*60 = 18000 samples for this hour.
For the day you might go down to 1 sample every 5 seconds, reducing the amount by a factor of 25. The algorith would then run every 5 seconds, and calculate the median, min and max of the 5 seconds what passed 24 hours ago. Results in 12*60*23 = 16560 more samples per day, and if you store
Further back I recommend a sample every minute, reducing the amount by 12 for maybe two week, so you have 60*24*13 = 18720 more samples for two weeks data.
Special consideration should be taken for storing the data in the DB. To get max performance you should ensure data of one sensor is stored in one block in thae database. If you use e.g. PostgreSQL, you know that one block is 8192 bytes in length, and no two records are stored in one block. Assuming one sample has 4 bytes length, and considering the overhead per row I could add 2048 minus a few samples in one block. Given the highest resolution, this are 2040 / 5 / 60 = 6 minutes of data. It MIGHT be a good idea now to always add 6 minutes at once, maybe 5 to be just dummies to update in the later minutes, so the queries can fetch blocks of a single sensor faster.
At least I would use different tables for different sensor granularity.
自某些天起,Amazon CloudWatch 允许您使用自定义指标 也是如此。如果监控和报警是您主要关心的问题,这可能会有所帮助。
Since some days Amazon CloudWatch allows you to use custom metrics as well. If monitoring and alarming is your main concern, this may be helpful.