解决搜索操作性能的最佳策略 - SQL Server 2008
我正在开发一个移动网站,该网站越来越受欢迎,这导致一些关键数据库表的增长 - 并且我们在访问这些表时开始看到一些性能问题。我们不是数据库专家(现阶段也没有钱聘请任何专家),我们正在努力了解导致性能问题的原因。我们的表并没有那么大,因此 SQL Server 应该能够很好地处理它们,并且我们已经在优化查询方面做了我们所知道的一切。因此,这是(伪)表结构:
[user] (approx. 40,000 rows, 37 cols):
id INT (pk)
content_group_id INT (fk)
[username] VARCHAR(20)
...
[content_group] (approx. 200,000 rows, 5 cols):
id INT (pk)
title VARCHAR(20)
...
[content] (approx. 1,000,000 rows, 12 cols):
id INT (pk)
content_group_id INT (fk)
content_type_id INT (fk)
content_sub_type_id INT (fk)
...
[content_type] (2 rows, 3 cols)
id INT (pk)
...
[content_sub_type] (8 rows, 3 cols)
id INT (pk)
content_type_id INT (fk)
...
我们预计这些行数会大幅增长(特别是用户、内容组和内容表)。是的,用户表有相当多的列 - 我们已经确定了一些可以移动到其他表中的列。我们还对受影响的表应用了一些索引,这也起到了帮助作用。
最大的性能问题是我们用来搜索用户的存储过程(其中包括与 content_group_id 字段上的内容表的联接)。我们尝试使用各种不同的方法修改 WHERE
和 AND
子句,我们认为我们已经尽了最大努力,但仍然太慢。
我们尝试过的另一件事没有帮助,那就是在用户和内容表上放置索引视图。当我们这样做时,没有明显的性能提升,因此我们放弃了这个想法,因为视图层固有的额外复杂性。
那么,我们有什么选择呢?我们可以想到一些,但都各有利弊:
表结构的非规范化
在用户表和内容表之间添加多个直接外键约束 - 因此内容会有不同的外键每个内容子类型的表。
优点:
- 使用主键连接内容表将更加优化。
缺点:
- 我们现有的存储过程和网站代码将会有很多变化。
- 维护最多 8 个额外的外键(更实际的是,我们只使用其中的 2 个)将不会像当前的单键那么容易。
表结构的更多非规范化
只需将我们需要的字段从内容表直接复制到用户表中即可。
优点:
- 不再需要连接内容表——这显着减少了 SQL 要做的工作。
缺点
- 与上面相同:在用户表中需要维护额外的字段,对 SQL 和网站代码进行更改。
创建中间层索引层
使用 Lucene.NET 之类的工具,我们可以在数据库之上放置一个索引层。理论上,这将提高所有搜索的性能,同时减少服务器上的负载。
优点:
- 这是一个很好的长期解决方案。 Lucene 的存在是为了提高搜索引擎的性能。
缺点:
- 短期内会有更大的开发成本 - 我们需要尽快解决这个问题。
这些是我们提出的,现阶段我们认为第二个选择是最好的 - 我知道非规范化有它的问题,但有时最好牺牲架构纯度以获得性能增益,所以我们准备支付这笔费用。
还有其他可能对我们有用的方法吗?我上面概述的方法是否还有其他可能影响我们决策的优点和/或缺点?
I'm working on a mobile website which is growing in popularity and this is leading to growth in some key database tables - and we're starting to see some performance issues when accessing those tables. Not being database experts (nor having the money to hire any at this stage) we're struggling to understand what is causing the performance problems. Our tables are not that big so SQL Server should be able to handle them fine and we've done everything we know to do in terms of optimising our queries. So here's the (pseudo) table structure:
[user] (approx. 40,000 rows, 37 cols):
id INT (pk)
content_group_id INT (fk)
[username] VARCHAR(20)
...
[content_group] (approx. 200,000 rows, 5 cols):
id INT (pk)
title VARCHAR(20)
...
[content] (approx. 1,000,000 rows, 12 cols):
id INT (pk)
content_group_id INT (fk)
content_type_id INT (fk)
content_sub_type_id INT (fk)
...
[content_type] (2 rows, 3 cols)
id INT (pk)
...
[content_sub_type] (8 rows, 3 cols)
id INT (pk)
content_type_id INT (fk)
...
We're expecting those row counts to grow considerably (in particular the user, content_group, and content tables). Yes the user table has quite a few columns - and we've identified some which can be moved into other tables. There are also a bunch of indexes we've applied to the affected tables which have helped.
The big performance problems are the stored procedures we're using to search for users (which include joins to the content table on the content_group_id field). We have tried to modify the WHERE
and AND
clauses using various different approaches and we think we have got them as good as we can but still it's too slow.
One other thing we tried which hasn't helped was to put an indexed view over the user and content tables. There was no noticeable performance gain when we did this so we've abandoned that idea due to the extra level of complexity inherent in having a view layer.
So, what are our options? We can think of a few but all come with pros and cons:
Denormalise of the Table Structure
Add multiple direct foreign key constraints between the user and content tables - so there would be a different foreign key to the content table for each content sub type.
Pros:
- Joining the content table will be more optimal by using its primary key.
Cons:
- There will be a lot of changes to our existing stored procedures and website code.
- Maintaining up to 8 additional foreign keys (more realistically we'll only use 2 of these) will not be anywhere near as easy as the current single key.
More Denormalisation of the Table Structure
Just duplicate the fields we need from the content table into the user table directly.
Pros:
- No more joins for to the content table - which significantly reduces the work SQL has to do.
Cons
- Same as above: extra fields to maintain in the user table, changes to SQL and website code.
Create a Mid-Tier Indexing Layer
Using something like Lucene.NET, we'd put an indexing layer above the database. This would in theory improve performance of all search and at the same time decrease the load on the server.
Pros:
- This is a good long-term solution. Lucene exists to improve search engine performance.
Cons:
- There will be a much larger development cost in the short term - and we need to solve this problem ASAP.
So those are the things we've come up with and at this stage we're thinking the second option is the best - I'm aware that denormalising has it's issues however sometimes it's best to sacrifice architectural purity in order to get performance gains so we're prepared to pay that cost.
Are there any other approaches which might work for us? Are there any additional pros and/or cons with the approaches I've outlined above which may influence our decisions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
此描述表明您的昂贵查询会根据
content_type
中的字段筛选content
表:此表设计以及您刚刚看到的结果问题是实际上很常见。出现问题的主要原因是查找表的选择性非常低(
content_type
有 2 行,因此 content_type_id 在内容中的选择性可能是 50%,非常大)。您可以尝试以下几种解决方案:1) 使用 content_type_id 作为主键组织聚集索引上的
content
表。这将允许连接进行范围扫描,并避免为投影完整性进行键/书签查找。当聚集索引发生变化时,它会对其他查询产生影响,因此必须仔细测试。content
上的主键显然必须使用非聚集约束来强制执行。2) 预读取
content_type_id
值,然后在不连接content
和content_type
的情况下制定查询:仅当 content_type_id 具有选择性时,此方法才有效高(许多不同的值,每个值很少),我怀疑这是你的情况(你可能只有很少的内容类型,每个值有很多条目)。
3) 将 content_Type 反规范化为 content。您提到了非规范化,但您将内容非规范化为用户的建议对我来说毫无意义。删除
content_type
表,将 content_type 字段拉入content
表本身,并接受所有非规范化问题。4) 在物化视图中预加入。你说你已经尝试过了,但我怀疑你是否尝试过正确的物化视图。您还需要了解,只有企业版自动使用物化视图索引,所有其他版本都需要 NOEXPAND 提示:
解决方案 2)、3) 和 4) 大多是学术性的。鉴于 content_type_id 的选择性非常低,唯一有机会的解决方案是使其成为
content
聚集索引中的主键。我没有将分析扩展到content_Sub_type
,但只有 8 行,我愿意打赌它也有同样的问题,这需要将其也推入聚集索引(也许作为第二个)主键)。This description would indicate that your expensive query filters the
content
table based on fields fromcontent_type
:This table design, and the resulting problem you just see, is quite common actually. The problems arise mainly due to the very low selectivity of the lookup tables (
content_type
has 2 rows, therefore the selectivity of content_type_id in content is probably 50%, huge). There are several solutions you can try:1) Organize the
content
table on clustered index with content_type_id as the leading key. This would allow the join to do range scans and also avoid the key/bookmark lookup for the projection completeness. As a clustered index change, it would have implications on other queries so it has to be carefully tested. The primary key oncontent
would obviously have to be enforced with a non-clustered constraint.2) Pre-read the
content_type_id
value and then formulate the query without the join betweencontent
andcontent_type
:This works only if the selectivity of content_type_id is high (many distinct values with few rows each), which I doubt is your case (you probaly have very few content types, with many entries each).
3) Denormalize content_Type into content. You mention denormalization, but your proposal of denormalizing content into users makes little sense to me. Drop the
content_type
table, pull in the content_type fields into thecontent
table itself, and live with all the denormalization problems.4) Pre-join in a materialized view. You say you already tried that, but I doubt that you tried the right materialized view. You also need to understand that only Enterprise Edition uses the materialized view index automatically, all other editions require the NOEXPAND hint:
Solutions 2), 3) and 4) are mostly academic. Given the very low selectivity of content_type_id, your only solution that has a standing chance is to make it the leading key in the clustered index of
content
. I did not expand the analysis tocontent_Sub_type
, but with only 8 rows I'm willing to bet it has the very same problem, which would require to push it also into the clustered index (perhaps as the second leading key).