高采样率数据的数据库设计,多种缩放级别的绘图

发布于 2024-10-12 04:22:27 字数 364 浏览 6 评论 0原文

我有多个传感器向我的网络应用程序提供数据。每个通道每秒 5 个样本,数据以 1 分钟 json 消息捆绑在一起上传(包含 300 个样本)。 数据将使用 flot 以 1 天到 1 分钟的多个缩放级别绘制成图表。

我正在使用 Amazon SimpleDB,目前将数据存储在我收到的 1 分钟块中。这对于高缩放级别非常有效,但对于全天来说,将有太多行需要检索。

我目前的想法是,每小时我都可以抓取数据并收集过去一小时的 300 个样本,并将它们存储在另一个表中,本质上是对数据进行下采样。

这听起来是一个合理的解决方案吗?其他人如何实施同类系统?

I've got multiple sensors feeding data to my web app. Each channel is 5 samples per second and the data gets uploaded bundled together in 1 minute json messages (containing 300 samples).
The data will be graphed using flot at multiple zoom levels from 1 day to 1 minute.

I'm using Amazon SimpleDB and I'm currently storing the data in the 1 minute chunks that I receive it in. This works well for high zoom levels, but for full days there will be simply be too many rows to retrieve.

The idea I've currently got is that every hour I can crawl through the data and collect together 300 samples for the last hour and store them in another table, essentially down-sampling the data.

Does this sound like a reasonable solution? How have others implemented the same sort of systems?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

一念一轮回 2024-10-19 04:22:27

存储下采样数据是非常好的方法。
查看 munin 如何存储其图表 - 每日、每月、早期和日内图表分别存储在那里。

您可以将每分钟、每 5 分钟、每小时、每 4 小时、每天的数据存储在不同的表中。与仅存储每一分钟相比,开销非常小,而且有很多好处,因为您不会传输不需要的内容。

Storing downsampled data is perfectly fine approach.
Check out how munin stores it's graphs - dayly, mounthly, early and intraday graphs are stored separately there.

You may store data for each minute, each 5 minutes, each hour, each 4 hours, each day in different tables. Overhead is very little in comparison to just storing every minute with lots of benefit as you don't transmit what you don't need to.

梦过后 2024-10-19 04:22:27

加快数据库速度,使用直接组织模型。这是从文件存储/检索数据的最快方法。实现非常简单,您不需要任何框架或库。

方法是:

  1. 你必须创建一个算法,它将密钥转换为连续的记录数(0..最大记录数),
  2. 你必须使用固定的记录大小,
  3. 数据被存储在平面文件中,记录在文件中的位置是记录。不。 (基于键,如 1. 中所述)乘以记录大小(请参阅 2.)。

原生数据

您每天可以创建一个数据文件,以便于维护。那么你的钥匙就是号。一天内的样本。因此,您的每日文件将为 18000 * 24 * 记录大小。您应该用 0 预先创建该文件,以使操作系统的生活更轻松(也许它没有多大帮助,它取决于底层文件系统/缓存机制)。

因此,当数据到达时,计算文件位置,并将记录插入到其位置。

汇总数据

您也应该将汇总数据存储在直接文件中。这些文件会小得多。如果是 1 分钟汇总值,则其中将有 24*60*60 条记录。

您必须做出一些决定:

  • 缩放的步进、
  • 汇总数据的步进(不确定是否值得为每个缩放步进收集汇总数据)、
  • 汇总数据库的组织(本机数据可能是存储在日文件中,但日数据应存储在月文件中)。

另一件事是要考虑,汇总数据的创建时间。虽然本机数据应该在数据到达时就存储,但汇总数据可以随时计算:

  • 当本机数据到达时(在这种情况下,1秒的数据更新了300次,这不是立即写入磁盘的最佳选择,求和应该在内存中完成);
  • 后台作业应定期处理本机数据,
  • 应根据需要以惰性方式创建总数据。

不要忘记,在不久前,这些问题还是数据库设计问题。我可以保证一件事:它将很快,比任何东西都快(除了使用内存来存储数据)。

Speed up the database, use direct organization model. It's the fastest method to store/retrieve data from files. The implementation is as simple, that you don't need any framework or library.

The method is:

  1. you have to create an algorhytm, which converts the key to a continous record numero (0..max. number of records),
  2. you have to use fixed record size,
  3. the data is stored in flat files, where the record's position within the file is the rec. no. (based on key, as described in 1.) multiplied by the record size (see 2.).

Native data

You may create one data file per day for easier maintenance. Then your key is the no. of the sample within the day. So, your daily file will be 18000 * 24 * record size. You should pre-create that file with 0s in order to make operating system's life easier (maybe it does not help much, it depends on underlying filesystem/caching mechanism).

So, when a data arrives, calculate the file position, and insert the record to its place.

Summarized data

You should store summarized data in direct files, too. These files will be much smaller ones. In case of the 1-minute summarized values there will be 24*60*60 records in it.

There're some decisions, which you have to take:

  • the stepping of zoom,
  • the steping of the summarized data (it's not sure to worth collect summarized data for each zoom stepping),
  • the organization of the summarized databases (the native data may be stored in daily files, but the daily data should be stored in monthly files).

Another thing is to think about, the creation time of the summarized data. While native data should be stored just as the data arrives, summarized data may be calculated any time:

  • as the native data arrives (in this case, a 1-s data is updated 300 times, which is not optimal to write to disk immediatelly, the summing should be done in memory);
  • a background job should process the native data periodically,
  • the sum data should be created lazy way, on demand.

Don't forget, not-too-many years ago these issues were the database design issues. I can promise one thing: it will be fast, faster than anything (except using memory for storing data).

夜未央樱花落 2024-10-19 04:22:27

我前段时间通过对一些图表进行动态下采样来实现这一点。缺点是年纪大了会失去分辨率,但我相信这对你来说是可以接受的。如果您对峰值感兴趣,您可以提供最大值、平均值和最小值。

该算法也不太难。如果每秒有 5 个样本,并且想要保持此粒度大约一小时,则必须在这一小时内存储 5*60*60 = 18000 个样本。

当天,您可能会减少到每 5 秒 1 个样本,将数量减少 25 倍。然后算法将每 5 秒运行一次,并计算 24 小时前过去 5 秒的中位数、最小值和最大值。结果每天会多出 12*60*23 = 16560 个样本,如果您

进一步存储,我建议每分钟保存一个样本,大约两周内减少 12 个样本,因此您还有 60*24*13 = 18720 个样本两周数据。

在数据库中存储数据时应特别考虑。为了获得最大性能,您应该确保一个传感器的数据存储在数据库的一个块中。如果您使用例如 PostgreSQL,您就会知道一个块的长度为 8192 字节,并且一个块中不会存储两条记录。假设一个样本有 4 字节长度,并考虑到每行的开销,我可以在一个块中添加 2048 减去一些样本。考虑到最高分辨率,这是 2040 / 5 / 60 = 6 分钟的数据。现在最好总是一次添加 6 分钟,也许 5 分钟只是在稍后的几分钟内更新的虚拟数据,这样查询就可以更快地获取单个传感器的块。

至少我会针对不同的传感器粒度使用不同的表。

I implemented this some time ago with downsampling on the fly for some graphs. The drawback is that older looses resolution, but I believe this is acceptable for you. And if you are interested in peaks you could provide max, avg, and min values.

The algorithm isn't too hard also. If you have 5 samples per second and want to hold this granularity for maybe an hour you have to store 5*60*60 = 18000 samples for this hour.

For the day you might go down to 1 sample every 5 seconds, reducing the amount by a factor of 25. The algorith would then run every 5 seconds, and calculate the median, min and max of the 5 seconds what passed 24 hours ago. Results in 12*60*23 = 16560 more samples per day, and if you store

Further back I recommend a sample every minute, reducing the amount by 12 for maybe two week, so you have 60*24*13 = 18720 more samples for two weeks data.

Special consideration should be taken for storing the data in the DB. To get max performance you should ensure data of one sensor is stored in one block in thae database. If you use e.g. PostgreSQL, you know that one block is 8192 bytes in length, and no two records are stored in one block. Assuming one sample has 4 bytes length, and considering the overhead per row I could add 2048 minus a few samples in one block. Given the highest resolution, this are 2040 / 5 / 60 = 6 minutes of data. It MIGHT be a good idea now to always add 6 minutes at once, maybe 5 to be just dummies to update in the later minutes, so the queries can fetch blocks of a single sensor faster.

At least I would use different tables for different sensor granularity.

司马昭之心 2024-10-19 04:22:27

自某些天起,Amazon CloudWatch 允许您使用自定义指标 也是如此。如果监控和报警是您主要关心的问题,这可能会有所帮助。

Since some days Amazon CloudWatch allows you to use custom metrics as well. If monitoring and alarming is your main concern, this may be helpful.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文