使用周期性传感器数据设计数据库

发布于 2024-10-20 20:56:33 字数 2501 浏览 3 评论 0原文

我正在设计一个 PostgreSQL 数据库，它可以从许多传感器源获取读数。我对设计做了很多研究，正在寻找一些新的输入来帮助我摆脱困境。

需要明确的是，我并不是在寻求描述数据源或任何相关元数据的帮助。我特别想弄清楚如何最好地存储数据值（最终是各种类型）。

输入数据的基本结构如下：

对于每个数据记录设备，有多个通道。
对于每个通道，记录器读取数据并将其附加到带有时间戳的记录。
不同的通道可能具有不同的数据类型，但通常 float4 就足够了。
用户应该（通过数据库功能）能够添加不同的值类型，但这个问题是次要的。
记录器和通道也将通过功能添加。

这种数据布局的显着特征是，我有许多通道将数据点与带有时间戳和索引号的单个记录相关联。

现在，描述数据量和常见访问模式：

每分钟大约有 5 个记录器传入数据，每个记录器有 48 个通道。
- 本例中的总数据量为每天 345,600 次读数，每年 1.26 亿次，并且这些数据至少需要在未来 10 年持续读取。
更多记录器和未来将添加通道，可能来自物理上不同类型的设备，但希望具有类似的存储表示。
常见访问将包括跨所有记录器查询类似的通道类型以及跨记录器时间戳加入。例如，从 logger1 获取通道 1，从 logger2 获取通道 4，并在 logger1.time = logger2.time 上执行完整外连接。

我还应该提到，每个记录器时间戳都会因时间调整而发生变化，并将在不同的表中进行描述，显示服务器的时间读数、记录器的时间读数、传输延迟、时钟调整以及最终调整后的时钟值。根据检索，这将发生在一组记录器记录/时间戳上。这是我使用下面的 RecordTable 的动机，但只要我可以从某处引用（记录器、时间、记录）行来更改关联数据的时间戳，那么现在就不用太担心了。

我考虑了相当多的模式选项，最简单的类似于混合 EAV 方法，其中表本身描述属性，因为大多数属性只是一个称为“值”的真实值。这是一个基本布局：

RecordTable          DataValueTable
----------           --------------
[PK] id        <--   [FK] record_id
[FK] logger_id       [FK] channel_id
record_number        value
logger_time

考虑到 logger_id、record_number 和 logger_time 是唯一的，我想我在这里使用代理键，但希望我的节省空间的理由在这里是有意义的。我还考虑过向 DataValueTable 添加 PK id（而不是 record_id 和 channel_id 的 PK），以便引用其他表中的数据值，但我现在正在努力抵制让这个模型“过于灵活”的冲动。然而，我确实希望尽快开始让数据流动，并且当稍后需要添加额外的功能或不同结构的数据时不必更改这部分。

起初，我为每个记录器创建记录表，然后为每个通道创建值表，并在其他地方（在一个地方）描述它们，并使用视图将它们连接起来，但这感觉“错误”，因为我重复了同样的事情，所以很多次。我想我正在尝试在太多表和太多行之间找到一个折衷方案，但是对更大的数据（DataValueTable）进行分区似乎很奇怪，因为我很可能在 channel_id 上进行分区，因此每个分区的每一行都具有相同的值。此外，在这方面进行分区需要在每次添加通道时重新定义主表中的检查条件。按日期分区仅适用于 RecordTable，考虑到它相对较小（每天 7200 行，有 5 个记录器），这实际上并不是必要的。

我还考虑在 channel_id 上使用上面的部分索引，因为 DataValueTable 会变得非常大，但通道 ID 集将保持很小，但我真的不确定多年后这将能够很好地扩展。我已经用模拟数据做了一些基本测试，性能只是马马虎虎，我希望它随着数据量的增长而保持出色。此外，有些人对清理和分析大型表以及处理大量索引（在本例中最多 250 个）表示担忧。

在一个非常小的旁注中，我还将跟踪此数据的更改并允许注释（例如，一只鸟拉在传感器上，因此这些值被调整/标记等），因此在考虑时请记住这一点这里的设计，但目前这是一个单独的问题。

关于我的经验/技术水平的一些背景，如果有助于了解我来自哪里：我是一名计算机科学博士生，作为我研究的一部分，我定期使用数据/数据库。然而，我在为客户（这是业务的一部分）设计一个具有超长寿命和灵活数据表示的强大数据库的实践经验有些有限。我认为我现在的主要问题是我正在考虑解决这个问题的所有角度，而不是专注于完成它，而且我根本看不到摆在我面前的“正确”解决方案。

总之，我想这些是我对你的主要询问：如果你做过这样的事情，什么对你有用？我在这里提出的各种设计没有看到哪些优点/缺点？考虑到这些参数和访问模式，您如何设计这样的东西？

我很乐意在需要时提供澄清/详细信息，并提前感谢您的出色表现。

原文

I'm designing a PostgreSQL database that takes in readings from many sensor sources. I've done a lot of research into the design and I'm looking for some fresh input to help get me out of a rut here.

To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).

The basic structure of the data coming in is as follows:

For each data logging device, there are several channels.
For each channel, the logger reads data and attaches it to a record with a timestamp
Different channels may have different data types, but generally a float4 will suffice.
Users should (through database functions) be able to add different value types, but this concern is secondary.
Loggers and channels will also be added through functions.

The distinguishing characteristic of this data layout is that I've got many channels associating data points to a single record with a timestamp and index number.

Now, to describe the data volume and common access patterns:

Data will be coming in for about 5 loggers, each with 48 channels, for every minute.
- The total data volume in this case will be 345,600 readings per day, 126 million per year, and this data needs to be continually read for the next 10 years at least.
More loggers & channels will be added in the future, possibly from physically different types of devices but hopefully with similar storage representation.
Common access will include querying similar channel types across all loggers and joining across logger timestamps. For example, get channel1 from logger1, channel4 from logger2, and do a full outer join on logger1.time = logger2.time.

I should also mention that each logger timestamp is something that is subject to change due to time adjustment, and will be described in a different table showing the server's time reading, the logger's time reading, transmission latency, clock adjustment, and resulting adjusted clock value. This will happen for a set of logger records/timestamps depending on retrieval. This is my motivation for RecordTable below but otherwise isn't of much concern for now as long as I can reference a (logger, time, record) row from somewhere that will change the timestamps for associated data.

I have considered quite a few schema options, the most simple resembling a hybrid EAV approach where the table itself describes the attribute, since most attributes will just be a real value called "value". Here's a basic layout:

RecordTable          DataValueTable
----------           --------------
[PK] id        <--   [FK] record_id
[FK] logger_id       [FK] channel_id
record_number        value
logger_time

Considering that logger_id, record_number, and logger_time are unique, I suppose I am making use of surrogate keys here but hopefully my justification of saving space is meaningful here. I have also considered adding a PK id to DataValueTable (rather than the PK being record_id and channel_id) in order to reference data values from other tables, but I am trying to resist the urge to make this model "too flexible" for now. I do, however, want to start getting data flowing soon and not have to change this part when extra features or differently-structured-data need to be added later.

At first, I was creating record tables for each logger and then value tables for each channel and describing them elsewhere (in one place), with views to connect them all, but that just felt "wrong" because I was repeating the same thing so many times. I guess I'm trying to find a happy medium between too many tables and too many rows, but partitioning the bigger data (DataValueTable) seems strange because I'd most likely be partitioning on channel_id, so each partition would have the same value for every row. Also, partitioning in that regard would require a bit of work in re-defining the check conditions in the main table every time a channel is added. Partitioning by date is only applicable to the RecordTable, which isn't really necessary considering how relatively small it will be (7200 rows per day with the 5 loggers).

I also considered using the above with partial indexes on channel_id since DataValueTable will grow very large but the set of channel ids will remain small-ish, but I am really not certain that this will scale well after many years. I have done some basic testing with mock data and the performance is only so-so, and I want it to remain exceptional as data volume grows. Also, some express concern with vacuuming and analyzing a large table, and dealing with a large number of indexes (up to 250 in this case).

On a very small side note, I will also be tracking changes to this data and allowing for annotations (e.g. a bird crapped on the sensor, so these values were adjusted/marked etc), so keep that in the back of your mind when considering the design here but it is a separate concern for now.

Some background on my experience/technical level, if it helps to see where I'm coming from: I am a CS PhD student, and I work with data/databases on a regular basis as part of my research. However, my practical experience in designing a robust database for clients (this is part of a business) that has exceptional longevity and flexible data representation is somewhat limited. I think my main problem now is I am considering all the angles of approach to this problem instead of focusing on getting it done, and I don't see a "right" solution in front of me at all.

So In conclusion, I guess these are my primary queries for you: if you've done something like this, what has worked for you? What are the benefits/drawbacks I'm not seeing of the various designs I've proposed here? How might you design something like this, given these parameters and access patterns?

I'll be happy to provide clarification/details where needed, and thanks in advance for being awesome.

分享到QQ

分享到微博