对于表来说,这是一个糟糕的索引策略吗?
该表是供应商软件在我们网络上使用的数据库的一部分。 该表包含有关文件的元数据。 表的架构如下
Metadata
ResultID (PK, int, not null)
MappedFieldname (char(50), not null)
Fieldname (PK, char(50), not null)
Fieldvalue (text, null)
ResultID 和 Fieldname 上有聚集索引。 该表通常包含数百万行(在一种情况下,它包含 5 亿行)。 该表由 24 个工作线程填充,每个工作线程在“处理”数据时运行 4 个线程。 这会导致许多非顺序插入。 处理后,我们的一些内部软件会将更多数据插入到该表中。 给定表的碎片至少为 50%。 对于最大的表,该比例为 90%。 我们没有 DBA。 我知道我们迫切需要一个数据库维护策略。 就我的背景而言,我是一名在这家公司兼职的大学生。
我的问题是,聚集索引是解决这个问题的最佳方法吗? 是否应该考虑另一个指数? 对于这种类型和类似的临时 DBA 任务有什么好的参考吗?
The table in question is part of a database that a vendor's software uses on our network. The table contains metadata about files. The schema of the table is as follows
Metadata
ResultID (PK, int, not null)
MappedFieldname (char(50), not null)
Fieldname (PK, char(50), not null)
Fieldvalue (text, null)
There is a clustered index on ResultID and Fieldname. This table typically contains millions of rows (in one case, it contains 500 million). The table is populated by 24 workers running 4 threads each when data is being "processed". This results in many non-sequential inserts. Later after processing, more data is inserted into this table by some of our in-house software. The fragmentation for a given table is at least 50%. In the case of the largest table, it is at 90%. We do not have a DBA. I am aware we desperately need a DB maintenance strategy. As far as my background, I'm a college student working part time at this company.
My question is this, is a clustered index the best way to go about this? Should another index be considered? Are there any good references for this type and similar ad-hoc DBA tasks?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
索引策略完全取决于您查询表的方式以及您需要从各个查询中获得多少性能。
当进行无序插入时(这称为“页面拆分”),聚集索引可以强制对行进行物理(在磁盘上)重新排序。 在索引页上没有可用空间的大型表中,这可能需要一些时间。
如果您不是绝对需要拥有跨越两个字段的聚集索引,那么就不要这样做。 如果它更像是一种 UNIQUE 约束,那么请务必使其成为 UNIQUE 约束。 这些不需要重新排序。
确定针对表的典型查询是什么,并相应地放置索引。 索引越多,数据更改(INSERT/UPDATE/DELETE)就越慢。 不要创建太多索引,例如在不太可能被过滤/排序的字段上。
通常,仅对一起过滤/排序的字段创建组合索引。
The indexing strategy entirely depends on how you query the table and how much performance you need to get out of the respective queries.
A clustered index can force re-sorting rows physically (on disk) when out-of-sequence inserts are made (this is called "page split"). In a large table with no free space on the index pages, this can take some time.
If you are not absolutely required to have a clustered index spanning two fields, then don't. If it is more like a kind of a UNIQUE constraint, then by all means make it a UNIQUE constraint. No re-sorting is required for those.
Determine what the typical query against the table is, and place indexes accordingly. The more indexes you have, the slower data changes (INSERTs/UPDATEs/DELETEs) will go. Don't create too many indexes, e.g. on fields that are unlikely to be filtered/sorted on.
Create combined indexes only on fields that are filtered/sorted on together, typically.
仔细查看您的查询 - 那些访问表以获取数据的查询。 指数会发挥作用吗? 如果您按顺序在 (ResultID, FieldName) 上有一个索引,但您正在查询给定 Fieldname 的可能 ResultID 值,则 DBMS 很可能会忽略该索引。 相比之下,如果您在 (FieldName, ResultID) 上有索引,那么它可能会使用该索引 - 当然是为了简单的值查找 (
WHERE FieldName = 'abc'
)。 就唯一性而言,任一索引都效果良好; 在查询优化方面,存在(至少可能)巨大的差异。使用 EXPLAIN 查看您的查询是如何处理的数据库管理系统。
聚集索引与非聚集索引通常是 DBMS 中的二阶优化效果。 如果索引正确,则聚集索引和非聚集索引之间存在很小的差异(聚集索引的更新惩罚更大,作为对稍小的选择时间的补偿)。 在担心二阶效应之前,请确保其他一切都已优化。
Look hard at your queries - the ones that hit the table for data. Will the index serve? If you have an index on (ResultID, FieldName) in that order, but you are querying for the possible ResultID values for a given Fieldname, it is likely that the DBMS will ignore the index. By contrast, if you have an index on (FieldName, ResultID), it will probably use the index - certainly for simple value lookups (
WHERE FieldName = 'abc'
). In terms of uniqueness, either index works well; in terms of query optimization, there is (at least potentially) a huge difference.Use EXPLAIN to see how your queries are being handled by your DBMS.
Clustered vs non-clustered indexing is usually a second-order optimization effect in the DBMS. If you have the index correct, there is a small difference between clustered and non-clustered index (with a bigger update penalty for a clustered index as compensation for slightly smaller select times). Make sure everything else is optimized before worrying about the second-order effects.
据我所知,聚集索引还可以。 对于其他索引,您将需要提供对此表进行操作的典型 SQL 查询。 突然创建索引从来都不是一个好主意。
您正在谈论碎片和索引,这是否意味着您怀疑查询执行速度变慢? 或者您只是想缩小数据库/索引/对其进行碎片整理?
在下班时间不时地进行索引碎片整理任务是一个好主意,尽管您必须考虑到频繁/随机插入时,在表中保留一些空闲空间以防止页面拆分(这确实影响性能)。
The clustered index is OK as far as I see. Regarding other indexes you will need to provide typical SQL queries that operate on this table. Just creating an index out of the blue is never a good idea.
You're talking about fragmentation and indexing, does it mean that you suspect that query execution slows down? Or do you simply want to shrink/defragment the database/index?
It is a good idea to have a task to defragment indexes from time to time during off-hours, though you have to consider that with frequent/random inserts it does not hurt to have some spare space in the table to prevent page splits (which do affect performance).
+1 识别该需求
不断学习,积累经验,同时获得经验丰富的顾问。
我认为这在工作日中是非常关键的任务,而停机是个坏消息? 如果是这样,请不要乱用它。
ResultID 是 PK 中的第一列吗?如您所指出的?
这样,我敢打赌它的选择性不够,并且根据查询的需求,应该交换 PK 字段的顺序(尽管这个复合键看起来对于集群 PK 来说是一个糟糕的选择)
如果是 因为
SELECT COUNT(*), COUNT(DISTINCT ResultID) FROM MyTable
如果第一个计数是第二个计数的 4 倍或更多,则您很可能会优先获得扫描而不是搜索, ResultsID的选择性较低,一些简单的改变就会带来巨大的性能提升。
此外,Fieldname 相当宽(50 个字符),因此任何二级索引都会向每个索引条目添加 50 + 4 个字节。 这些字段真的是 CHAR 而不是 VARCHAR 吗?
我个人会考虑增加叶子页面的密度。 在 90% 时,您只会留下一些空白 - 也许每页一个。 但对于包含 5 亿行的大型表,更高的填充密度可能意味着树中的级别更少,从而减少检索次数。 与此相反,对于给定页面,几乎每个插入都需要页面拆分。 这将有利于聚集的插入,因此可能不合适(假设您的插入数据可能不是聚集的)。 与许多事情一样,您需要进行测试以确定哪种索引键密度最有效。 SQL Server 有一些工具可以帮助分析查询如何被解析、它们是否被缓存、它们引起的表扫描次数、哪些查询“运行缓慢”等等。
请一位顾问过来看看并给你一些建议。 这不是一个问题,在这里回答将为您提供一个安全的实施解决方案。
对于每天有 5 亿行和大量插入的表,您确实需要仔细考虑维护策略。 抱歉,我对陷入这种状态的公司感到非常沮丧。
该表需要进行碎片整理(如果没有聚集索引,您的选择将变得更少,因此请保留它,直到您决定有更好的候选者)。 “在线”碎片整理方法将对性能产生适度的影响,并且可能会突然消失 - 如果超出时间/CPU 限制,可以安全地中止(尽管这很可能需要一些编程)。 如果您有一个“安静”插槽,则将其用于表碎片整理和更新索引统计信息。 不要等到周末才尝试一次性完成所有桌子 - 在每天任何安静的时间(大概是在晚上)尽可能多地完成。
对表进行碎片整理可能会导致事务日志使用量大幅增加,因此请确保经常备份任何 TLog(我们有 10 分钟的 TLog 备份策略,在表碎片整理期间将其增加到每分钟一次,以便碎片整理过程不会成为所需 Tlog 空间的定义!)
+1 for identifying that need
Keep studying, gain experience, but get an experienced consultant in in the meantime.
I presume this is pretty mission critical during the working day, and downtime is bad news? If so don't clutz with it.
Is ResultID the first column in the PK, as you indicate?
If so I'll bet that it is insufficiently selective and, depending on what the needs are of the queries, the order of the PK fields should be swapped (notwithstanding that this compound key looks to be a poor choice for the clustered PK)
What's the result of:
SELECT COUNT(*), COUNT(DISTINCT ResultID) FROM MyTable
If the first count is, say, 4 x as big as the second, or more, you will most likely be getting scans in preference to seeks, because of the low selectively of ResultsID, and some simple changes will give huge performance improvements.
Also, Fieldname is quite wide (50 chars) so any secondary indexes will have 50 + 4 bytes added to every index entry. Are the fields really CHAR rather than VARCHAR?
Personally I would consider increased the density of the leaf pages. At 90% you will only leave a few gaps - maybe one-per-page. But with a large table of 500 million rows the higher packing density may mean fewer levels in the tree, and thus fewer seeks for retrieval. Against that almost every insert, for a given page, will require a page split. This would favour inserts that are clustered, so may not be appropriate (given that your insert data is probably not clustered). Like many things, you'd need to make a test to establish what index key density works best. SQL Server has tools to help analyse how queries are being parsed, whether they are being cached, how many scans of the table they cause, which queries are "slow running", and so on.
Get a consultant in to take a look and give you some advice. This aint a question that answers here are going to give you a safe solution to implement.
You really REALLY need to have some carefully thought through maintenance policies for tables that have 500 millions rows and shed-loads of inserts daily. Sorry, but I have enormous frustration with companies that get into this state.
The table needs defragmenting (your options will become fewer if you don't have a clustered index, so keep that until you decide that there is a better candidate). "Online" defragmentation methods will have modest impact on performance, and can chug away - and can safely be aborted if they overrun time / CPU constraints [although that will most likely take some programming]. If you have a "quiet" slot then use it for table defragmentation and updating the statistics on indexes. Don't wait until the weekend to try to do all tables in one go - do as much/many as you can during any quiet time daily (during the night presumably).
Defragmenting the tables is likely to lead to a huge increased in Transaction log usage, so make sure that any TLogs are backed up frequently (we have a 10 minute TLog backup policy, which we increase to every minute during table defragging so that the defragging process doesn't become the definition of required Tlog space!)