tsql查询和索引问题
我有一个表存储照片信息,以 id 作为主键:
id(PK)、标题、album_id、posted_by、 已发布、文件名、标签、评级、 发布日期
该表将保存 100 多万张照片的信息 我需要经常像这样运行这个查询:
1)获取给定相册的所有照片(只是 id、文件名、标题列)
从照片中选择 ID、文件名、标题 其中 album_id = @AlbumId 并且 已发布 = 1
2) 获取给定用户的所有已发布照片,但排除当前查看相册的照片
从照片中选择 ID、文件名、标题 其中 posts_by='bob' 和 album_id <>10 且已发布 = 1
我想避免索引和表扫描。我需要尽可能多地使用seek(比如100%)。
这可以做到吗? 什么类型的索引以及哪些列可以帮助我实现这一目标?
谢谢
I have a table that stores photo information with id as Primary key:
id(PK), title, album_id, posted_by,
published, filename, tags, ratings,
date_posted
This table will hold infor of 100+ Million photos and
I need to run this query like these frequently:
1) get all photos (just id,filename,title columns) of a given album
select id, filename, title from photos
where album_id = @AlbumId and
published = 1
2) get all published photos of a given user but exclude photos of currently viewing album
select id, filename, title from photos
where posted_by='bob' and album_id
<>10 and published = 1
I want to avoid index and table scanning. I need to use seek(say 100%) as much as possible.
Can this be done?
What type of index and on which columns would help me achieve this?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
实际上,您只能通过在不断调整、调整和测量之前测量性能来找到这一点。
但根据您的查询,您应该考虑(或至少首先尝试)像这样的非聚集索引:
推理:
album_id
和published< 的 WHERE 子句/code> - 因此,首先在索引中使用这两列,
posted_by
- 将其放入与第三列相同的索引中,id、filename、title
列完成所有这些操作后,您应该会看到主要针对新的非索引搜索聚集索引来满足您的查询。但同样:许多其他因素也会发挥作用,您可能没有在问题中提到这些因素,甚至可能没有考虑到您自己 - 但这种方法应该为您提供一个良好的起点。
In reality, you'll only be able to find this out yourself by measuring performance before you tweak, then tweak, and measure again and again.
But based on your query, you should consider (or at least try this first) a non-clustered index like this:
Reasoning:
album_id
andpublished
- so use these two columns first in your indexposted_by
in the WHERE clause - put that into that same index as the third columnid, filename, title
columns in the indexWith all these things in place, you should be seeing mostly index seeks on that new non-clustered index to satisfy your queries. But again: lots of other factors also come into play which you probably haven't mentioned in your question and possibly not even thought about yourself - but this approach should give you a good starting point no less.
您没有提到是否需要在查询中使用 date_posted 或 id 作为过滤条件,因此最好在非时间列上使用聚集索引(我假设当前的聚集索引是PK对吗?)。
我会在 album_id 上创建一个聚集索引。
如果您无法更改聚集索引,或者有许多其他查询受益于现有聚集索引,那么我支持@marc_s的答案(并将相应地投票。)
You didn't mention if there's a need to use the date_posted or the id as filter criteria in the query, so it might be best to use a CLUSTERED index on a non-chronological column (I'm assuming that the current CLUSTERED index is the PK. Right?).
I would create a CLUSTERED index on the album_id.
If you can't change the CLUSTERED index or there are many other queries that benefit from the existing clustered index, then I support the answer from @marc_s (and will vote accordingly.)
我建议在
album_id
上建立一个聚集索引,在posted_by
上建立一个二级索引,如果前者是最受打击的索引。如果posted_by
被点击最多,则反转它们。根据每个album_id
或posted_by
有多少张照片,在调用代码中过滤published
可能是相当可行的(在其他情况下)换句话说,不要将其添加为查询中的限制,而是过滤客户端)。如果没有,您必须将已发布的约束添加到查询中,但album_id
的主要限制应该意味着只会对published
进行小规模扫描。但如上所述,在published
客户端进行过滤可能会更容易。I would suggest a clustered index on
album_id
and a secondary index onposted_by
, if the former is the one that will get hit most. Invert them ifposted_by
is hit most. Depending on how many photos there are for eachalbum_id
orposted_by
, it may be quite feasible to filter onpublished
in the calling code (in other words, don't add it as a restriction in the query, rather filter client-side). If not, you'll have to add that published constraint into the query, but the primary restriction ofalbum_id
should mean that only a small scan onpublished
is incurred. But as stated, it may be easier just to filter onpublished
client-side.Id 上的主键。使其成为非聚集的。我猜这不会被太多使用(特别是如果所有查找都是通过专辑或海报进行的)。
AlbumId 上的聚集索引。似乎它会在大多数查询中使用。
Posted_By 上的非聚集索引。使用AlbumId 聚集索引,它将出现在该索引的叶级,因此其行为非常类似于 INCLUDEd 列。根据使用情况,将其作为聚集索引可能会更好...但作为 varchar(20),它会占用更多磁盘空间,并且性能会比 AlbumId 更差(假设 AlbumId 是 int)。
您不能将 Published 作为索引中的列,因为您无法对位列建立索引。您也不希望——在 1 亿多行中只有两个可能的值,SQL 可能永远不会使用它来优化查询。
我建议规范化 Posted_By (将其移动到自己的表,为其提供自己的代理键,并将其用作该表中的外键)。这将显着减少主表中的存储空间,提高整体性能,并允许您在需要时将聚集索引翻转到该列。 (另外,如果“Bob”向该表发帖,然后来自另一端的“Bob”也发帖,您如何区分 Bob 和 Bob?)
Primary key on Id. Make it non-clustered. I'd guess this won't be used much (particularly if all lookups are by album or poster).
Clustered index on AlbumId. Seems like it'd be used in most queries.
Non-clustered index on Posted_By. With AlbumId the clustered index, it will appear at the leaf-level of this index, and so act pretty much like an INCLUDEd column. Depending on usage it might be better to have this as the clustered index... but as a varchar(20), it'd take up more disk space, and performance would be worse than AlbumId (assuming AlbumId is an int).
You cannot have Published as a column in the index, as you cannot index on bit columns. Nor would you want to--with only two possible values across 100M+ rows, SQL would probably never use it for optimizing queries.
I'd recommend normalizing Posted_By (move it to its own table, give it its own surrogate key, and use that as a foreign key in this table). This would significantly reduce storage space in your main table, increase overall performance, and allow you to flip the clustered index to that column if need be. (Also, if "Bob" posts to the table, and then "Bob" from across town also posts, how do you tell Bob from Bob?)