SQL Server 2008存储气象数据的设计问题
我们使用的是 SQL Server 2008 R2 企业版。
我们正在通过所谓的气象桅杆测量气象数据。 基本上这是一个带有大量设备的桅杆;桅杆上不同位置的风速计(用于风速)、温度计和气压。 我们测量每一秒。
而且它占用了太多的磁盘空间。 该设备的下一代每台每年将产生超过 10 GB 的容量。我们将拥有超过 1000 个这样的产品。
当前的表设计看起来有点像这样:
CREATE TABLE #MetMast (
MetMastID INT NOT NULL IDENTITY(1,1),
MetMastName NVARCHAR(100),
CountryID INT,
InstallDate DATE
)
CREATE TABLE #MetMastData (
MetMastDataID BIGINT NOT NULL IDENTITY(1,1),
MetMastID INT NOT NULL,
MeasuredAt DATETIME2(0) NOT NULL,
Temperature REAL NULL,
WindSpeedAt10m REAL NULL,
WindSpeedAt30m REAL NULL,
AirPressure REAL NULL,
OneHundredMoreColumns VARCHAR(200),
CONSTRAINT PK_MetMastData PRIMARY KEY CLUSTERED
(
MetMastID ASC,
MeasuredAt ASC
))
WITH (DATA_COMPRESSION = ROW)
-- ON a file group, with table partitioning
ALTER TABLE #MetMastData WITH NOCHECK ADD CONSTRAINT FK_MetMast_MetMastID FOREIGN KEY (#MetMast) REFERENCES #MetMast(MetMastID)
数据写入一次,读取很多很多次。
我们在数据仓库中使用它,典型的问题是:根据 MetMast 计算当温度高于 20 度时 WindSpeedAt10m 和 WindSpeedAt30m 之间有多少次 2 m/s 的差异。
SELECT MetMastId, COUNT_BIG(*) FROM #metMastData
WHERE temperature>20 AND ABS(WindSpeedAt10m-WindSpeedAt30m) >2
GROUP BY MetMastID
将来一点点数据丢失是可以接受的。
我们在这里讨论的是数据的有损压缩。我知道我们必须为每个字段定义一个可接受的误差,如果我们以 10% 的精度进行测量,则误差为 1%。
它适用于声音文件(MP3 相当大),因此它也可能适用于我们。
但是这是如何完成的?
我应该选择什么样的桌子设计?
如何开始对数据库表中的数据进行有损压缩?
此致,
Henrik Staun Poulsen
We're using SQL Server 2008 R2 Enterprise Edition.
We are measuring meteorological data from what we call MetMasts.
Basically this is a mast with lots of equipment; anemometers (for wind speed) at different positions on the mast, thermometers , and air pressure.
We measure every second.
And it takes up tooooo much disk space.
The next generation of this equipment will generate over 10 GB per year each. And we’re going to have more than 1000 of these.
The current table design looks a bit like this:
CREATE TABLE #MetMast (
MetMastID INT NOT NULL IDENTITY(1,1),
MetMastName NVARCHAR(100),
CountryID INT,
InstallDate DATE
)
CREATE TABLE #MetMastData (
MetMastDataID BIGINT NOT NULL IDENTITY(1,1),
MetMastID INT NOT NULL,
MeasuredAt DATETIME2(0) NOT NULL,
Temperature REAL NULL,
WindSpeedAt10m REAL NULL,
WindSpeedAt30m REAL NULL,
AirPressure REAL NULL,
OneHundredMoreColumns VARCHAR(200),
CONSTRAINT PK_MetMastData PRIMARY KEY CLUSTERED
(
MetMastID ASC,
MeasuredAt ASC
))
WITH (DATA_COMPRESSION = ROW)
-- ON a file group, with table partitioning
ALTER TABLE #MetMastData WITH NOCHECK ADD CONSTRAINT FK_MetMast_MetMastID FOREIGN KEY (#MetMast) REFERENCES #MetMast(MetMastID)
The data is write once, read many, many times.
We use it in our data warehouse, where a typical question would be; Count how many times there is a 2 m/s difference between WindSpeedAt10m and WindSpeedAt30m when the temperature is above 20 degrees, per MetMast.
SELECT MetMastId, COUNT_BIG(*) FROM #metMastData
WHERE temperature>20 AND ABS(WindSpeedAt10m-WindSpeedAt30m) >2
GROUP BY MetMastID
In the future a tiny bit of data loss will be accepted.
We’re talking lossy compression of data here. I know we will have to define an acceptable error for each of the fields, as in 1% if we measure with 10% accuracy.
It worked for sound files (MP3 is quite big), so it might work for us as well.
But how is this done?
What table design should I go for?
How do I get started with lossy compression of data in database tables?
Best regards,
Henrik Staun Poulsen
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于每个数据点,请考虑您需要存储的准确性。
REAL 每行占用四个字节。如果您可以删除 WindSpeed 的所有小数位,则可以使用tinyint(1 字节,1-255)。考虑到您很可能需要一定的精度,您可以使用smallint,并将实际值乘以100:
这将为您每行节省两个字节并存储一些精度,尽管在某些时候会有所损失。由于看起来您将拥有相当多的此类列,因此每列节省 2 个字节就相当于相当多了。
您的 MetMastDataID 有一个 8 字节的 bigint。有必要吗?不是所有的东西都会被MetMastID和MeasuredAT查询吗?删除它会节省 8 个字节。然而,它会导致碎片,因为您的聚集键将不再是连续的,因此需要进行碎片整理。由于这听起来像是一个归档/OLAP 系统,因此这应该不是一个大问题。
编辑:我刚刚意识到您没有聚集在 MetMastDataID 上,因此碎片从现在起不会改变。那么问题是 - 您是否曾将 MetMastDataID 用于任何用途?
此外,如果您可以避免所有可变长度列,那么每行将节省 2 个字节 + 每个可变长度列记录开销 2 个字节,不包括实际的可变长度数据本身。
For each of your data points, consider the accuracy you need to store.
REAL takes up four bytes for each row. If you could drop all decimal places for WindSpeed, you could probably do with a tinyint (1 byte, 1-255). Given that you most likely need some precision, you could use a smallint instead and multiply the actual value by 100:
This would save you two bytes per row and store some precision, though with a loss at some point. Since it seems you'll have quite a lot of these columns, a 2 byte saving per column would amount to quite a lot.
You've got an 8 byte bigint for your MetMastDataID. Is it necesary? Won't everything be queried by MetMastID and MeasuredAT? Dropping that will save you 8 bytes. It will however result in fragmentation since your clustered key will no longer be sequential, so defragmentation will be necessary. Since this sounds like an archival/OLAP system, that shouldn't be a big problem.
EDIT: I just realized you're not clustered on the MetMastDataID so fragmentation won't change from now. Question is then - do you ever use the MetMastDataID for anything?
Further - if you can avoid all variable length columns, that'll save you 2 bytes + 2 bytes per variable length column of record overhead, per row, not including the actual variable length data itself.
有损压缩基于人类的身体可能性,通过眼睛或耳朵来确定差异。例如 Mp3 或 JPEG 有损压缩。在您的情况下,这种有损压缩没有意义,因为您使用数字而不是音频/视频数据进行操作。
要实现无损压缩,可以使用 CLR 函数。示例如下:http://www.codeproject .com/KB/database/blob_compress.aspx。
Lossy compression is based on human`s physical possibilities to determine difference by eye or ear. Examples are Mp3 or JPEG lossy compression. In your case such kind of lossy compression has no sense, because you operate with digits not with audio/video data.
To implement lossless comression you can use CLR function.Example is here:http://www.codeproject.com/KB/database/blob_compress.aspx.