在数据库中存储具有可变数量字段的测量值的最有效方法是什么?
我们有一个数据收集系统,可以从环境传感器收集测量结果,这些传感器测量流经河流或渠道的水流速度。每次测量都会生成固定数量的值(例如日期、时间、温度、压力等)以及速度值列表。
最初,传感器提供了三个速度值,因此我只是将每个值存储在 FireBird 数据库中单个表的自己的列中。后来引入的传感器可以输出最多九个速度值,所以我简单地添加了六列。尽管大多数传感器使用的值少于 9 个,但我认为如果大多数列只包含零,那不会有问题。
但现在我面临着新一代,它可以输出从 1 到 256 个值的任何值,并且我认为添加另外 247 列不会非常有效,特别是因为大多数测量值仍然只包含 3 到 9 个值。
由于测量结果每 10 分钟收集一次,并且数据库包含 30 到 50 个传感器的所有数据,因此几年后数据总量相当可观,但必须能够生成任何随机时间段的概览/图表。
那么存储变量值列表的最有效方法是什么?
由于每个记录都有其自己的唯一 ID,因此我假设我可以将所有速度值存储在单独的表中,每个值都标有其记录 ID。我只是有一种感觉,这不会很有效,而且一段时间后会变得非常慢。
We have a data collection system that collects measurements from environmental sensors that measure velocity of water flowing through a river or channel. Each measurement generates a fixed number of values (e.g. Date, Time, Temperature, Pressure etc.) plus a list of velocity values.
Originally the sensors supplied three velocity values so I simply stored each value in it's own column of a single table in a FireBird database. Later on sensor were introduced that could output up to nine velocity values so I simple added six more columns. Even though most sensors use less than 9 values, I reckoned it would not be a problem if most of the columns just contained zeroes.
But now I'm facing a new generation that can output anything from 1 to 256 values and I assume it will not be very efficient to add another 247 columns, especially since most of the measurements will still only contain 3 to 9 values.
Since the measurements are collected every 10 minutes, and the database contains all data for 30 to 50 sensors the total amount of data is quite significant after a few years, yet it must be possible to generate overviews/graphs for any random period of time.
So what would be the most efficient way to store the variable list of values ?
Since each record has it's own unique ID, I assume I could just store all velocity values in separate table, each value tagged with it's record ID. I just have the feeling that this would not be very efficient and that it would get very slow after while.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果使用高效索引,数据库可以处理表中的大量数据。所以你可以使用这个表结构:
在
id
、id、seq
和ts
上创建索引。这将使您能够有效地搜索数据。如果您不信任您的数据库,只需插入几百万行并运行几次选择来查看它的表现如何。作为比较:我这里有一个 Oracle 数据库,有 1.12 亿行,我可以在 120 毫秒(0.12 秒)内按时间戳或 ID 选择记录
Databases can handle large amounts of data in a table if you use efficient indexes. So you can use this table structure:
Create an index on
id
,id, seq
andts
. That will allow you to search efficiently through the data. If you distrust your database, just insert a few million rows and run a couple of selects to see how well it fares.For comparison: I have an Oracle database here with 112 million rows and I can select a record by timestamp or ID within 120ms (0.12s)
您可以将序列化数据保存在文本字段中,例如将测量结果 JSON 编码为:
然后,在代码中,查询后反序列化值。
如果您仅按其他字段而不是按保存的值过滤查询,那么这应该很有效。如果您确实按值进行过滤,那么在
WHERE
子句中使用它们将是一场噩梦。You could save serialized data in a text field, for example JSON-encoding the measurements as:
Then, in your code, deserialize the values after querying.
This should work well if you only filter your queries by the other fields, and not by the saved values. If you do filter by the values, using them in
WHERE
clauses will be a nightmare.我会选择第二个表:
Velocity.MeasurementId
引用Measurements.Id
。Velocity.Sequence
是该测量的速度值索引 (1-256)。使用尽可能接近真实世界的数据填充这些表,并测试 SQL 语句以找到最佳索引。
I'd go with a second table:
Velocity.MeasurementId
referencesMeasurements.Id
.Velocity.Sequence
is the index of the velocity value for that measurement (1-256).Populate these tables with data as close to real-world as possible and test the sql statements to find the best indexes.