大数据集的数据库结构建议

发布于 2024-10-08 18:29:08 字数 2864 浏览 5 评论 0原文

在我看来，这个问题没有准确的答案，因为需要过于复杂的分析和深入研究我们系统的细节。

我们有分布式传感器网络。信息收集在一个数据库中并进行进一步处理。

目前的数据库设计是每月划分一张大表。我们尝试将其保持在 10 亿条（通常为 600-8 亿条记录），因此填充率为每天 20-5000 万条记录。

数据库服务器目前是MS SQL 2008 R2，但我们从2005年开始，并在项目开发过程中升级。

该表本身包含 SensorId、MessageTypeId、ReceiveDate 和 Data 字段。当前的解决方案是将传感器数据保留在数据字段（二进制，16 字节固定长度）中，部分解码其类型并将其存储在 messageTypeId 中。

我们有不同类型的传感器发送的消息类型（当前约为 200），并且可以根据需要进一步增加。

主要处理在应用程序服务器上完成，应用程序服务器按需获取记录（按类型、传感器 ID 和日期范围），对其进行解码并执行所需的处理。目前的速度对于这样的数据量来说已经足够了。

我们要求将系统容量增加 10-20 倍，我们担心我们当前的解决方案能够满足这一要求。

我们还有两个“优化”结构的想法，我想讨论一下。

1 传感器的数据可以分为多种类型，为了简单起见，我将使用 2 个主要类型：（值）级别数据（具有值范围的模拟数据）、状态数据（固定数量的值），

因此我们可以重新设计我们的表以一堆使用以下规则来处理小问题：

对于每个固定类型值（状态类型），使用 SensorId 和 ReceiveDate 创建自己的表（因此我们避免存储类型和二进制 blob），所有依赖的（扩展）状态将存储在自己的表中类似的外键，所以如果我们有 State ，其值为 A 和 B，以及它的依赖（或附加）状态 1< /code> 和 2 我们以表 StateA_1、StateA_2、StateB_1、StateB_2。因此表名称由它代表的固定状态组成。
对于我们创建单独的表的每个模拟数据，它将类似于第一种类型，但可以包含

优点：

仅存储所需的数据量（当前我们的二进制 blob 数据包含最长值的空间）并减少数据库大小；
为了获取特定类型的数据，我们可以访问正确的表，而不是按类型过滤；

缺点：

据我所知，它违反了推荐的做法；
需要开发框架来自动化表管理，因为手动维护它对 DBA 来说是地狱；
表的数量可能相当大，因为需要完全覆盖可能的值；
引入新的传感器数据甚至已定义状态的新状态值时，数据库模式会发生变化，因此可能需要复杂的更改；
管理复杂，容易出错；
在这样的表组织中插入值可能是数据库引擎的地狱？
DB结构不固定（不断变化）；

也许所有缺点都超过了一些优点，但如果我们获得显着的性能提升和/或（不太受欢迎但也很有价值）存储空间，也许我们会遵循这种方式。

2 也许只是根据传感器范围拆分表（大约 100 000 个表）或更好，和/或使用专用服务器移动到不同的数据库，但如果可能的话，我们希望避免硬件跨度。

3 保持原样。

4 切换到不同类型的 DBMS，例如面向列的 DBMS（HBase 和类似）。

你怎么认为？也许您可以建议进一步阅读的资源？

更新： 系统的性质是，有些来自传感器的数据即使延迟一个月（通常延迟 1-2 周）也能到达，有些始终在线，有些传感器具有板载内存并最终上线。每个传感器消息都有关联的事件引发日期和服务器接收日期，因此我们可以区分最近的数据和前段时间收集的数据。处理包括一些统计计算、参数偏差检测等。我们构建了聚合报告以供快速查看，但是当我们从传感器获取数据更新旧数据（已处理）时，我们必须从头开始重建一些报告，因为它们依赖于所有可用的报告无法使用数据和聚合值。因此，我们通常会保留 3 个月的数据以供快速访问和其他存档。我们努力减少存储数据所需的量，但我们决定需要所有这些来保持结果的准确性。

更新2：

此表包含主要数据。正如我在评论中提到的，我们在“需要速度”期间删除了它的所有依赖项和约束，因此它仅用于存储。

CREATE TABLE [Messages](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [sourceId] [int] NOT NULL,
    [messageDate] [datetime] NOT NULL,
    [serverDate] [datetime] NOT NULL,
    [messageTypeId] [smallint] NOT NULL,
    [data] [binary](16) NOT NULL
)

来自其中一台服务器的示例数据：

id      sourceId    messageDate          serverDate messageTypeId   data
1591363304  54  2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100

1588602646  195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D

1588607651  195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77

原文

It seems to me this question will be without precise answer since requires too complex analysis and deep dive into details of our system.

We have distributed net of sensors. Information gathered in one database and futher processed.

Current DB design is to have one huge table partitioned per month. We try keep it at 1 billion (usually 600-800 million records), so fill rate is at 20-50 million records per day.

DB server currently is MS SQL 2008 R2 but we started from 2005 and upgrade during project development.

The table itself contains SensorId, MessageTypeId, ReceiveDate and Data field. Current solution is to preserve sensor data in Data field (binary, 16 byte fixed length) with partially decoding it's type and store it in messageTypeId.

We have different kind of message type sending by sensors (current is approx 200) and it can be futher increased on demand.

Main processing is done on application server which fetch records on demand (by type, sensorId and date range), decode it and carry out required processing. Current speed is enough for such amount of data.

We have request to increase capacity of our system in 10-20 times and we worry is our current solution is capable of that.

We have also 2 ideas to "optimise" structure which I want to discuss.

1 Sensor's data can be splitted into types, I'll use 2 primary one for simplicity: (value) level data (analog data with range of values), state data (fixed amount of values)

So we can redesign our table to bunch of small ones by using following rules:

for each fixed type value (state type) create it's own table with SensorId and ReceiveDate (so we avoid store type and binary blob), all depended (extended) states will be stored in own table similar Foreign Key, so if we have State with values A and B, and depended (or additional) states for it 1 and 2 we ends with tables StateA_1, StateA_2, StateB_1, StateB_2. So table name consist of fixed states it represents.
for each analog data we create seperate table it will be similar first type but cantains additional field with sensor value;

Pros:

Store only required amount of data (currently our binary blob Data contains space to longest value) and reduced DB size;
To get data of particular type we get access right table instead of filter by type;

Cons:

AFAIK, it violates recommended practices;
Requires framework development to automate table management since it will be DBA's hell to maintain it manually;
The amount of tables can be considerably large since requires full coverage of possible values;
DB schema changes on introduction new sensor data or even new state value for already defined states thus can require complex change;
Complex management leads to error prone;
It maybe DB engine hell to insert values in such table orgranisation?
DB structure is not fixed (constantly changed);

Probably all cons outweight a few pros but if we get significant performance gains and / or (less preferred but valuable too) storage space maybe we follow that way.

2 Maybe just split table per sensor (it will be about 100 000 tables) or better by sensor range and/or move to different databases with dedicated servers but we want avoid hardware span if it possible.

3 Leave as it is.

4 Switch to different kind of DBMS, e.g. column oriented DBMS (HBase and similar).

What do you think? Maybe you can suggest resource for futher reading?

Update:
The nature of system that some data from sensors can arrive even with month delay (usually 1-2 week delay), some always online, some kind of sensor has memory on-board and go online eventually. Each sensor message has associated event raised date and server received date, so we can distinguish recent data from gathered some time ago. The processing include some statistical calculation, param deviation detection, etc. We built aggregated reports for quick view, but when we get data from sensor updates old data (already processed) we have to rebuild some reports from scratch, since they depends on all available data and aggregated values can't be used. So we have usually keep 3 month data for quick access and other archived. We try hard to reduce needed to store data but decided that we need it all to keep results accurate.

Update2:

Here table with primary data. As I mention in comments we remove all dependencies and constrains from it during "need for speed", so it used for storage only.

CREATE TABLE [Messages](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [sourceId] [int] NOT NULL,
    [messageDate] [datetime] NOT NULL,
    [serverDate] [datetime] NOT NULL,
    [messageTypeId] [smallint] NOT NULL,
    [data] [binary](16) NOT NULL
)

Sample data from one of servers:

id      sourceId    messageDate          serverDate messageTypeId   data
1591363304  54  2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100

1588602646  195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D

1588607651  195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月隐月明月朦胧 2024-10-15 18:29:08

只是想提出一些想法，希望它们有用——它们是我正在考虑/思考/研究的一些事情。

分区 - 您提到表是按月分区的。是您自己手动分区的，还是您使用企业版中提供的分区功能？如果手动，请考虑使用内置分区功能对数据进行更多分区，这应该会提高可扩展性/性能。此“分区表和索引” Kimberly Tripp 在 MSDN 上发表的文章很棒 - 里面有很多很棒的信息，我不会通过解释来不公正地对待它！值得考虑的是，与每个传感器手动创建 1 个表相比，这可能更难以维护/实施，因此增加了复杂性（简单 = 好）。当然，前提是你有企业版。

过滤索引 - 查看这篇 MSDN 文章

当然还有硬件元素——不言而喻，具有大量 RAM/快速磁盘等的强大服务器将发挥作用。

回复收藏 0 原文

凉月流沐 2024-10-15 18:29:08

一种与数据库关系不大的技术是转而记录值的变化——每分钟至少记录 n 条记录。因此，例如，如果传感器 1 发送类似以下内容：

Id  Date              Value
-----------------------------
1 2010-10-12 11:15:00 100
1 2010-10-12 11:15:02 100
1 2010-10-12 11:15:03 100
1 2010-10-12 11:15:04 105

则只有第一个和最后一个记录会在数据库中结束。为了确保传感器“实时”，每分钟至少输入 3 条记录。这样数据量就会减少。

不确定这是否有帮助，或者在您的应用程序中是否可行——只是一个想法。

编辑

是否可以根据访问概率来归档数据？说旧数据比新数据更不可能被访问是否正确？如果是这样，您可能想看看 Bill Inmon 的下一代数据仓库的 DW 2.0 架构，其中他讨论了通过不同 DW 区域（交互式、集成、近邻）移动数据的模型。线路、档案）基于访问概率。访问时间从非常快（交互区域）到非常慢（存档）不等。每个区域都有不同的硬件要求。目的是防止大量数据堵塞数据仓库。

One technique, not so much related to databases, is to switch to recording a change in values -- with having minimum of n records per minute or so. So, for example if as sensor no 1 is sending something like:

Id  Date              Value
-----------------------------
1 2010-10-12 11:15:00 100
1 2010-10-12 11:15:02 100
1 2010-10-12 11:15:03 100
1 2010-10-12 11:15:04 105

then only first and last record would end in the DB. To make sure that the sensor is "live" minimum of 3 records would be entered per minute. This way the volume of data would be reduced.

Not sure if this helps, or if it would be feasible in your application -- just an idea.

EDIT

Is it possible to archive data based on the probability of access? Would it be correct to say that old data is less likely to be accessed than new data? If so, you may want to take a look at look at Bill Inmon's DW 2.0 Architecture for The Next Generation of Data Warehousing where he discusses model for moving data through different DW zones (Interactive, Integrated, Near-Line, Archival) based on the probability of access. Access times vary from very fast (Interactive zone) to very slow (Archival). Each zone has different hardware requirements. The objective is to prevent large amounts of data clogging the DW.

回复收藏 0 原文