您对这个 SqlServer 设计问题的第一直觉

发布于 2024-09-27 21:19:56 字数 560 浏览 1 评论 0原文

我们有 2 张桌子。一个保存测量值,另一个保存时间戳(每分钟一个) 每个测量都保存一个时间戳的 FK。 我们有 8M(百万)测量值和 2M 时间戳。

我们正在通过复制创建报告数据库,我的第一个解决方案是:当通过复制过程收到新的测量值时,查找正确的时间戳并将其添加到测量表中。 是的,这是数据重复,但它用于报告,并且由于我们每 5 分钟进行一次测量,并且用户可以查询年度数据(105.000 次测量),因此我们必须优化速度。

但一位合作开发人员说:你不必这样做,我们只需使用联接进行查询(在两个表上),SqlServer 太快了,你看不到差异。

我的第一反应是:对具有 8M 和 2M 记录的两个表进行联接不能“没有区别”。

对此你的第一感觉是什么?

编辑: 新测量:每 5 分钟 400 条记录

编辑 2: 也许问题不是那么清楚:

第一个解决方案是在插入测量记录时从时间戳表中获取数据并将其复制到测量表中。 在这种情况下,当插入记录和一个额外的(重复的)时间戳值时,我们会执行一个操作。在这种情况下,我们只查询一张表,因为它保存了所有数据。

第二种解决方案是在查询中连接两个表。

We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.

We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.

But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.

My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.

What is your first feeling on this?

EDIT:
new measurements: 400 records per 5 minutes

EDIT 2:
maybe the question is not so clear:

the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.

The second solution is to join the two tables in a query.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

因为看清所以看轻 2024-10-04 21:19:56

使用正确的索引,连接不会产生任何影响*。我最初的想法是,如果报告查询整个数据集,连接实际上可能会更快,因为它必须从磁盘读取的时间戳实际上减少了 600 万个。

*这只是根据我对包含数百万条记录的表的经验的猜测。您的结果将根据您的查询而有所不同。

With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.

*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.

一生独一 2024-10-04 21:19:56

我将创建一个索引视图(类似于Oracle中的物化视图)它使用适当的索引连接表。

I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.

高速公鹿 2024-10-04 21:19:56

如果查询仅检索给定日期范围内的数据,则会出现合并联接 - 即对每个表进行范围扫描。由于时间戳表可能只包含时间戳,因此这应该不会很昂贵。
另一方面,如果日期列上只有一个表和索引,则索引本身会变得更大并且扫描成本更高。

因此,通过正确构建的索引和查询,我预计性能不会有显着差异。
我建议您保持适当的标准化设计,直到您开始遇到迫使您更改它的性能问题。然后,您需要仔细分析查询计划并使用不同的选项来衡量性能 - 在您的特定情况下,有很多事情可能很重要。

If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.

So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.

梦在深巷 2024-10-04 21:19:56

坦率地说,在这种情况下,您最好的选择是尝试两种解决方案,看看哪一种更好。当您开始谈论大型数据集时,性能调优是一门艺术,并且高度依赖于您所拥有的数据库设计,还取决于硬件以及是否使用分区等。请务必测试取出数据和放入数据由于您有如此多的插入,因此插入速度至关重要,并且日期时间字段上所需的索引对于选择性能至关重要,因此您确实需要对此进行彻底测试。测试时不要忘记转储缓存。并进行多次测试,如果可能的话,在典型的查询负载下进行测试。

Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文