帮助了解 50+GB DB 的数据库架构

发布于 2024-08-08 13:09:32 字数 918 浏览 4 评论 0原文

我的任务是在数据库中存储大量 GPS 数据和一些额外信息，并访问它以进行报告和其他一些不频繁的任务。

当我从 GPS 设备收到消息时，它可以具有可变数量的字段。例如

消息 1：DeviceId Lat Lon Speed Course DIO1 ADC1
消息 2：DeviceId Lat Course DIO2 IsAlarmOn
消息 3：DeviceId Lat Lon Height Course DIO2 IsAlarmOn 等最多 20-30 个字段

无法统一字段数量 - 不同的设备供应商、不同的协议等。另一个令人头疼的问题是数据库的大小以及支持尽可能多的数据库供应商的必要性（使用NHibernate）。

所以我想到以这种方式存储消息：
表1 - 曲目
PK - TrackId
曲目开始时间
曲目结束时间
FirstMessageIndex(存储MessageId)
LastMessageIndex(存储MessageId)
DeviceId（非 FK）

表 2 - 消息
PK - 消息 ID
时间戳
FirstDataIndex(存储DataId)
LastDataIndex(存储DataId)

Table3 - MessageData
PK - 数据 ID
双数据
短数据类型

所有索引都用 hilo 分配。调整我的查询，以便 Nhibernate 可以快速处理插入的 3000+k 条消息（也使用了 baching）。我对自动取款机的性能感到满意。但我不知道它在 50+GB 或 100+GB 大小时如何工作。

将非常感谢有关我的问题和总体存储设计的任何提示和提示=）
谢谢，阿列克谢
PS.抱歉我的英语=）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北斗星光 2024-08-15 13:09:32

简而言之，您的应用程序，特别是从 GPS 设备接收到的消息的异构结构，将您的设计推向EAV 数据存储结构（其中实体是消息，属性是“MessageData.DataType”，值系统上是双精度值。）

这三个表设计了您在问题中概述的内容，但是似乎与传统的 EAV 实现有点不同，从某种意义上说，有一个MessageData 存储方式的隐式序列，其中给定消息的所有数据点都按顺序编号 (DataId)，并且从消息到其数据点的链接将由一定范围内的 DataId 驱动。
这是一个坏主意！
这样做有很多问题，一个值得注意的问题是，这为插入消息带来了不必要的瓶颈，直到前一条消息的所有数据点都完成后才能开始插入第二条消息。
另一个问题是，它使得消息和数据点之间的关系难以索引（底层 DBMS 在这方面效率不高）。
==>建议：将 MessageId 作为 MessageData 表中的外键。（并且可能完全删除 MessageData 表中的 DataId PK，只是为了节省空间，但代价是必须使用组合键来引用此表中的特定记录，例如出于维护目的）

另一个建议是 在消息表级别存储最常见的属性（数据点）。例如，纬度和经度，但也可能是航向或某些警报等。将此信息与消息放在一起的原因是为了优化对数据的查询（限制 MessageData 表所需的自连接数量。

由于消息和MessageData 表可能不包含消息的一部分，您可能还想重命名后面的 MessageDetail 表或类似的名称。

最后，允许使用除 double 类型之外的数据值可能是个好主意。我预计某些警报只是布尔值等。除了允许您接受不同类型的数据点（例如短错误消息字符串......）之外，这还可能使您有机会将数据点拆分为多个“细节”表：一个用于双精度，一个用于布尔值，一个用于字符串等。从某种意义上说，这种做法使架构变得复杂，您需要将其中一些细节构建到查询的生成方式中，但它可以提供一些性能/扩展增益的潜力。

In a nutshell, your application, specifically the heterogeneous structure of the messages received from the GPS devices, pushes your design towards a EAV datastore structure (whereby the Entity is the Message , the Attribute is the "MessageData.DataType" and the Value is systematically a double.)

The Three tables design you outline in the question, however seem to depart a bit from a traditional EAV implementation, in a sense that there is an implicit sequence to the way MessageData is stored whereby all the data points for a given message are sequentially numbered (DataId), and the link from a message to its datapoints will be driven by DataId within a range.
That is a bad idea!
Many problems with that, a notable one being that this introduces a unnecessary bottleneck for the insertion of messages, Can't start inserting a second message until all datapoints for the previous message.
Another issue is that it makes the relation between message and datapoint difficult to index (underlying DBMS will not be efficient at it).
==> Suggestion: Make the MessageId a foreign key in MessageData table. (and possibly drop the DataId PK in MessageData table altogether, just to save the space, at the expense of having to use a composite key to refer to a particular record in this table, for example for maintenance purposes)

Another suggestion is to store the most common attributes (datapoints) at the level of the Message table. For example, Lat and Long, but maybe also Course or Some alarms etc. The reason for having this info right with the message is to optimize queries to the data (limiting the number of self joins necessary with MessageData table.

Since both the Messages and the MessageData tables may not contain part of the message, you may also want to rename the latter MessageDetail table, or some such name.

Finally, it may be a good idea to allow for data values other than these of the double type. I anticipate some of the alerts are merely boolean, etc. Aside from allowing you accept different kinds of datapoints (say short error message strings...) this may also give you the opportunity to split the datapoints over multiple "detail" tables: one for doubles, one for booleans, one for strings etc. This way of doing complicates the schema in a sense that you then need to build some of these details into the way the queries are produced, but it can provide some potential for performance / scaling gains.

回复收藏 0 原文

哆兒滾 2024-08-15 13:09:32

我将尝试在答案中更详细地描述它是如何工作的，因为评论的长度是固定的=）
这是接收序列：
1. 服务从 MSMQ 接收消息（消息数量可以不同 - atm 它使用 500 条消息批量数据包）。
2. 然后细化不同的设备 ID。
3. 对于每个设备 ID，它使用 MS EntLib 隔离存储缓存，其结构为：
设备ID -->列出 DeviceId 是查找键的位置。
4. 如果缓存中有超过 1k 条消息 - 将它们按顺序写入数据库，然后将“索引”写入查找表：
指数：
编号
序列号
索引_开始_日期时间
索引_结束_日期时间
index_first_dataid
index_last_dataid
5.清理此DeviceId的缓存

此外，我还成对存储数据：
id data1 data2 类型
例如经纬度、速度课程、adc1 adc2、dio1、dio2
如果没有耦合值：值 0，

我选择 double，因为我可以存储设备发送的每种类型的数据。
不发送字符串，但大多数 em 都是 csv 样式，例如 1,0,23,50.0000N30.00000,1,2,12,0,1,2 等。甚至警报等也具有相同类型的数据。
当我需要获取一些数据时，我只需找到给定日期时间窗口和 DeviceId 的索引，并获取实际数据，了解其何时开始和结束。并且没有复杂的查询。就2个简单的。其他代码使用某些协议“映射”来解释这一点。
感谢 EAV 提示。我觉得很合适。第一个表 Track 用于聚合消息并在我之前描述的几个字符串的检索算法中快速获取它们。