帮助了解 50+GB DB 的数据库架构
我的任务是在数据库中存储大量 GPS 数据和一些额外信息,并访问它以进行报告和其他一些不频繁的任务。
当我从 GPS 设备收到消息时,它可以具有可变数量的字段。例如
消息 1:DeviceId Lat Lon Speed Course DIO1 ADC1
消息 2:DeviceId Lat Course DIO2 IsAlarmOn
消息 3:DeviceId Lat Lon Height Course DIO2 IsAlarmOn 等最多 20-30 个字段
无法统一字段数量 - 不同的设备供应商、不同的协议等。 另一个令人头疼的问题是数据库的大小以及支持尽可能多的数据库供应商的必要性(使用NHibernate)。
所以我想到以这种方式存储消息:
表1 - 曲目
PK - TrackId
曲目开始时间
曲目结束时间
FirstMessageIndex(存储MessageId)
LastMessageIndex(存储MessageId)
DeviceId(非 FK)
表 2 - 消息
PK - 消息 ID
时间戳
FirstDataIndex(存储DataId)
LastDataIndex(存储DataId)
Table3 - MessageData
PK - 数据 ID
双数据
短数据类型
所有索引都用 hilo 分配。调整我的查询,以便 Nhibernate 可以快速处理插入的 3000+k 条消息(也使用了 baching)。 我对自动取款机的性能感到满意。但我不知道它在 50+GB 或 100+GB 大小时如何工作。
将非常感谢有关我的问题和总体存储设计的任何提示和提示=)
谢谢,阿列克谢
PS.抱歉我的英语=)
I have a task to store large amount of gps data and some extra info in database and to access it for reporting and some other non frequent tasks.
When I recieve a message from gps device it can have variable number of fields. For example
Message 1: DeviceId Lat Lon Speed Course DIO1 ADC1
Message 2: DeviceId Lat Course DIO2 IsAlarmOn
Message 3: DeviceId Lat Lon Height Course DIO2 IsAlarmOn etc. up to 20-30 fields
There is no way to unify number of fields - diffirent device vendors, diffirent protocols etc.
And another headache is size of database and necessity to support as much db vendors as possible(NHibernate is used).
So i came to idea to store messages that way:
Table1 - Tracks
PK - TrackId
TrackStartTime
TrackEndTime
FirstMessageIndex(stores MessageId)
LastMessageIndex(stores MessageId)
DeviceId(not an FK)
Table2 - Messages
PK - MessageId
TimeStamp
FirstDataIndex(stores DataId)
LastDataIndex(stores DataId)
Table3 - MessageData
PK - DataId
double Data
short DataType
All indexes are assignet with hilo. Tuned my queryes so Nhibernate can handle incerting 3000+k messages veeeeeery quickly(baching also used).
Im happy with perfomance atm. But i dunno how it will work at 50+gb or 100+ gb size.
Will be very grateful for any tips and hints about my issue and storage design overall=)
Thanks, Alexey
PS.Sorry for my english=)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
简而言之,您的应用程序,特别是从 GPS 设备接收到的消息的异构结构,将您的设计推向EAV 数据存储结构(其中实体是消息,属性是“MessageData.DataType”,值系统上是双精度值。)
这三个表设计了您在问题中概述的内容,但是似乎与传统的 EAV 实现有点不同,从某种意义上说,有一个MessageData 存储方式的隐式序列,其中给定消息的所有数据点都按顺序编号 (DataId),并且从消息到其数据点的链接将由一定范围内的 DataId 驱动。
这是一个坏主意!
这样做有很多问题,一个值得注意的问题是,这为插入消息带来了不必要的瓶颈,直到前一条消息的所有数据点都完成后才能开始插入第二条消息。
另一个问题是,它使得消息和数据点之间的关系难以索引(底层 DBMS 在这方面效率不高)。
==>建议:将 MessageId 作为 MessageData 表中的外键。 (并且可能完全删除 MessageData 表中的 DataId PK,只是为了节省空间,但代价是必须使用组合键来引用此表中的特定记录,例如出于维护目的)
另一个建议是 在消息表级别存储最常见的属性(数据点)。例如,纬度和经度,但也可能是航向或某些警报等。将此信息与消息放在一起的原因是为了优化对数据的查询(限制 MessageData 表所需的自连接数量。
由于消息和MessageData 表可能不包含消息的一部分,您可能还想重命名后面的 MessageDetail 表或类似的名称。
最后,允许使用除 double 类型之外的数据值可能是个好主意。 我预计某些警报只是布尔值等。除了允许您接受不同类型的数据点(例如短错误消息字符串......)之外,这还可能使您有机会将数据点拆分为多个“细节”表:一个用于双精度,一个用于布尔值,一个用于字符串等。从某种意义上说,这种做法使架构变得复杂,您需要将其中一些细节构建到查询的生成方式中,但它可以提供一些性能/扩展增益的潜力。
In a nutshell, your application, specifically the heterogeneous structure of the messages received from the GPS devices, pushes your design towards a EAV datastore structure (whereby the Entity is the Message , the Attribute is the "MessageData.DataType" and the Value is systematically a double.)
The Three tables design you outline in the question, however seem to depart a bit from a traditional EAV implementation, in a sense that there is an implicit sequence to the way MessageData is stored whereby all the data points for a given message are sequentially numbered (DataId), and the link from a message to its datapoints will be driven by DataId within a range.
That is a bad idea!
Many problems with that, a notable one being that this introduces a unnecessary bottleneck for the insertion of messages, Can't start inserting a second message until all datapoints for the previous message.
Another issue is that it makes the relation between message and datapoint difficult to index (underlying DBMS will not be efficient at it).
==> Suggestion: Make the MessageId a foreign key in MessageData table. (and possibly drop the DataId PK in MessageData table altogether, just to save the space, at the expense of having to use a composite key to refer to a particular record in this table, for example for maintenance purposes)
Another suggestion is to store the most common attributes (datapoints) at the level of the Message table. For example, Lat and Long, but maybe also Course or Some alarms etc. The reason for having this info right with the message is to optimize queries to the data (limiting the number of self joins necessary with MessageData table.
Since both the Messages and the MessageData tables may not contain part of the message, you may also want to rename the latter MessageDetail table, or some such name.
Finally, it may be a good idea to allow for data values other than these of the double type. I anticipate some of the alerts are merely boolean, etc. Aside from allowing you accept different kinds of datapoints (say short error message strings...) this may also give you the opportunity to split the datapoints over multiple "detail" tables: one for doubles, one for booleans, one for strings etc. This way of doing complicates the schema in a sense that you then need to build some of these details into the way the queries are produced, but it can provide some potential for performance / scaling gains.
我将尝试在答案中更详细地描述它是如何工作的,因为评论的长度是固定的=)
这是接收序列:
1. 服务从 MSMQ 接收消息(消息数量可以不同 - atm 它使用 500 条消息批量数据包)。
2. 然后细化不同的设备 ID。
3. 对于每个设备 ID,它使用 MS EntLib 隔离存储缓存,其结构为:
设备ID -->列出 DeviceId 是查找键的位置。
4. 如果缓存中有超过 1k 条消息 - 将它们按顺序写入数据库,然后将“索引”写入查找表:
指数:
编号
序列号
索引_开始_日期时间
索引_结束_日期时间
index_first_dataid
index_last_dataid
5.清理此DeviceId的缓存
此外,我还成对存储数据:
id data1 data2 类型
例如经纬度、速度课程、adc1 adc2、dio1、dio2
如果没有耦合值:值 0,
我选择 double,因为我可以存储设备发送的每种类型的数据。
不发送字符串,但大多数 em 都是 csv 样式,例如 1,0,23,50.0000N30.00000,1,2,12,0,1,2 等。甚至警报等也具有相同类型的数据。
当我需要获取一些数据时,我只需找到给定日期时间窗口和 DeviceId 的索引,并获取实际数据,了解其何时开始和结束。并且没有复杂的查询。就2个简单的。其他代码使用某些协议“映射”来解释这一点。
感谢 EAV 提示。我觉得很合适。第一个表 Track 用于聚合消息并在我之前描述的几个字符串的检索算法中快速获取它们。
Ill try to describe how it works now more detailed in answer, because comments have fixed length=)
Here is recieve sequence:
1. Service recieves messages from MSMQ(number of messages can differ-atm it uses 500 messages bulk packet).
2. Then refines distinct device Ids.
3. For each device id it uses MS EntLib isolated storage cache with structure:
DeviceId --> List where DeviceId is lookup key.
4. If we have more then 1k messages in cache - write them into db in one sequence and after write "index" to lookup table:
Index:
id
serial_id
index_start_datetime
index_end_datetime
index_first_dataid
index_last_dataid
5. Cleans cache for this DeviceId
Also i store data in couples:
id data1 data2 type
for example lat lon, speed course, adc1 adc2, dio1,dio2
and if there is no coupled value: value 0
I choose double because i can store every type of data devices send in it.
The dont send strings, but most of em are csv style like 1,0,23,50.0000N30.00000,1,2,12,0,1,2 etc. Even alarms and etc have same type of data.
When I need to get some data i just find indexes for given datetime window and DeviceId and get actual data knowing when it starts and ends. And there is no complex queryes. Just 2 simple ones. Other code is interpreting this using some protocol "mappings".
Thanks for EAV tip. I think it fits well. First table Track is for agregating messages and getting em quickly in retrival algorithm i described couple strings before.
我正在写类似的应用程序。我建议识别供应商的所有可能值,并使用所有必要字段创建适当的架构。因此,您可以编写高性能/最简单的报告查询。
此外,您可以创建包含指定(长度)数据的字段,这意味着您可以节省空间并提高性能。
我有一个供应商的值已知,因此我为此创建了一张表。
该表可以通过本机 MS SQL Server 机制轻松分区。
因此,我最简单的情况允许我编写一个存储过程来保存数据。没有 NHibernate,只有纯粹的 ICommand。
其余应用程序使用 NHibernate。
I'm writing similar application. I suggest to recognize all possible values from vendors and create proper schema with all necessary fields. Thanks to this you can write performant/simplest reporting queries.
Besides you can create fields that contain specified (length) data, which means you can save place and improve performance.
I have one vendor with known values so I created one table for this.
This table can be easy partitioned by native MS SQL Server mechanism.
So, my simplest situation allows me to write one stored procedure to saving data. There is no NHibernate, just pure ICommand.
Rest of application use NHibernate.