SQL Server 2005 中的 CHECKSUM() 冲突
我有一个包含 5,651,744 行的表,主键由 6 列组成(int x 3、smallint、varchar(39)、varchar(2))。 我希望提高此表和另一个表的性能,该表共享此主键,并添加了一个附加列,但有 3700 万行。
由于预期添加一列来创建哈希键,我进行了分析,发现了 18,733 个冲突。
SELECT SUM(CT)
FROM (
SELECT HASH_KEY
,COUNT(*) AS CT
FROM (
SELECT CHECKSUM(DATA_DT_ID, BANK_NUM, COST_CTR_NUM,
GL_ACCT_NUM, ACCT_NUM, APPN_CD) AS HASH_KEY
FROM CUST_ACCT_PRFTBLT
) AS X
GROUP BY HASH_KEY
HAVING COUNT(*) > 1
) AS Y
SELECT COUNT(*)
FROM CUST_ACCT_PRFTBLT
使用 BINARY_CHECKSUM()
的情况大约是前者的两倍,
考虑到我所覆盖的目标空间的相对量较小,这是否显得太高 (.33%)? 如果冲突如此之高,那么考虑到您仍然需要连接常规列来处理偶尔的冲突,那么在连接中首先连接这个制造的键是否有好处,而每行需要额外的 4 个字节?
I've got a table of 5,651,744 rows, with a primary key made of 6 columns (int x 3, smallint, varchar(39), varchar(2)). I am looking to improve the performance with this table and another table which shares this primary key plus an additional column added but has 37m rows.
In anticipation of adding a column to create the hash key, I did an analysis and found 18,733 collisions.
SELECT SUM(CT)
FROM (
SELECT HASH_KEY
,COUNT(*) AS CT
FROM (
SELECT CHECKSUM(DATA_DT_ID, BANK_NUM, COST_CTR_NUM,
GL_ACCT_NUM, ACCT_NUM, APPN_CD) AS HASH_KEY
FROM CUST_ACCT_PRFTBLT
) AS X
GROUP BY HASH_KEY
HAVING COUNT(*) > 1
) AS Y
SELECT COUNT(*)
FROM CUST_ACCT_PRFTBLT
It's about twice as bad with BINARY_CHECKSUM()
Does this seem too high (.33%) given the smaller relative amount of the destination space I'm covering? And if the collisions are this high, is there a benefit in joining on this manufactured key first in joins for the cost of the extra 4 bytes per row, given that you still have to join on the regular columns to handle the occasional collision?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我不知道添加校验和会给你带来那种程度的碰撞。 即使 1 次冲突也太多了,因为它会导致您连接到错误的数据。 如果您不能保证加入正确的记录,那么即使它提高了性能但会破坏数据完整性也是毫无意义的。 这似乎是财务数据,因此您最好确保您的查询不会返回不良结果。 如果存在任何冲突,您实际上可能会借记或贷记错误的账户。
如果您确实走这条路,马克是对的,您应该尽可能进行预计算(根据我的经验,添加必须对数百万记录表中的每条记录进行的计算不太可能提高性能)。 如果您可以执行预先计算的列(并且您需要触发器来使其保持最新),那么您可能不需要加入所有其他六个列来确保没有冲突。 那么你的表现可能会有所提高。 你所能做的就是测试你的理论。 但请确保没有发生任何碰撞。
您是否考虑过使用代理键,然后在六个自然键字段上使用唯一索引? 然后您可以加入代理键,这可能会大大提高性能。 连接六列(一列是 varchar)而不是一个代理键效率不高。 从数据的大小来看,这可能比非生产系统更难重构,但实际上,为了永久修复持续的性能问题可能值得停机。 只有您可以说这个更改会有多复杂,以及将所有 sps 或查询更改为更好的连接会有多困难。 不过,尝试一下也许是可行的。
I don't see where adding a checksum will get you anything with that level of collisons. Even 1 collision is too many as it would cause you to join to the wrong data. If you can't guarantee to be joining to the correct record, it is pointless if it improves performance but messes with data integrity. This appears to be financial data, so you had better be really sure that your queries won't return bad results. You could actually end up debiting or crediting the wrong accounts if there are any collisions.
If you do go this route, Marc is right that you should if at all possible pre-compute (Adding a computation that has to happen to every record in multimillion record tables is not likely to improve performance in my experience). Possibly if you can do the precomputed column (and you'll need triggers to keep it up-date) then you may not need to join to all six of the other columns to ensure no collisions. Then possibly you might have imporved performance. All you can do is test your theory. But be very sure you don't have any collisions.
Have you considered using a surrogate key and then a unique index on the six natural key fields instead? Then you could join on the surrogate key and likely that would improve performance a good bit. It can't be efficient to join on six columns (one a varchar) instead of one surrogate key. I realize from the size of the data, this might be harder to refactor than in a non-production system, but really it might be worth the down time to permananently fix persistent performance problems. Only you can say how complex a change this would be and how hard it would be to change all the sps or queries to a better join. However, it might be feasible to try.
到目前为止,我看到很多人掩盖的是
CHECKSUM
有大量冲突,作者:微软自己承认。 它甚至比MD5
更糟糕,MD5 也有相当多的有意义的冲突。如果您希望获取哈希列,请考虑使用
HASHBYTES
并指定了
SHA1
。与
MD5
或CHECKSUM
相比,SHA1
的有意义的冲突要少得多。 因此,CHECKSUM
决不应该用于确定行是否唯一,而是用于快速检查两个值的保真度。 因此,您的 HASHBYTES 冲突率应该为 0%,除非您有重复的行(作为 PK,这种情况永远不会发生)。请记住,
HASHBYTES
将截断任何大于 8000 字节的内容,但您的 PK 远小于该值(全部串联),因此您应该不会遇到任何麻烦。What I've seen a lot of folks glossing over thus far is that
CHECKSUM
has a ton of collisions, by Microsoft's own admission. It's even worse thanMD5
, which has its fair share of meaningful collisions.If you're looking to get a hash column, consider using
HASHBYTES
withSHA1
specified.SHA1
has a lot less meaningful collisions thanMD5
orCHECKSUM
. Therefore,CHECKSUM
should never be used to determine if a row is unique, but rather, it's a quick check on the fidelity of two values. Therefore, your collision rate should be 0% withHASHBYTES
, unless you have duplicate rows (which, being a PK, should never happen).Keep in mind that
HASHBYTES
will truncate anything larger than 8000 bytes, but your PK is a lot less than that (all concatenated), so you shouldn't have any trouble.如果您的校验和将其降低到数据的 0.33%,那么我认为它工作正常......特别是如果您将此列与其他(索引)列结合使用。
当然,为了作为索引有效,您可能需要在插入/更新数据时使用非聚集索引来计算和存储该值。
当然,对相关列进行常规跨越索引可能会同样好甚至更好......
If your checksum gets it down to 0.33% of the data, then I'd argue that it is working fine... especially if you use this column in combination with other (indexed) columns.
Of course, to be effective as an index you probably want to compute and store this value when inserting/updating data, with a non-clustered index.
Of course, a regular spanning index over the columns in question may do just as well or better...
如果您的查询是选择性的,并且行表聚集索引很窄或不存在,那么行表中校验和的非聚集索引应该提供良好的性能。
在对头表应用任何标准后,它将使用校验和对非聚集索引执行索引查找。 您仍然需要在连接中包含 FK,但非校验和连接标准将应用于索引查找后、书签查找后。 非常高效。
您想要优化索引查找。 校验和已经是高度选择性的。 添加 FK 会增加索引大小和相应的 I/O,除非它包含足够多的其他字段来完全避免书签查找,否则不会有任何帮助。
由于非聚集索引将包含聚集键或堆指针,因此您需要 a) 一个小的聚集键(例如,int 标识列 - 4 字节指针)或 b) 根本没有聚集索引(8 字节指针) 。
如果您的查询不是选择性的,或者行表聚集索引很大(整个表减去几列),那么我不知道校验和是否有帮助(也许更快的索引导航?)。 在任何情况下,您都希望将其设为聚集索引或覆盖索引,并且如果头表不首先聚集在校验和上,则会进行大量排序。
如果您能负担得起存储和索引成本,那么一些覆盖索引(标题和详细信息)可能是不错的选择。
If your queries are selective and the line table clustered index is narrow or non-existent, then a non-clustered index on checksum in the line table should provide good performance.
After applying whatever criteria is present to the header table, it will use the checksum to perform an index seek on the non-clustered index. You still need to include the FKs in the join, but the non-checksum join criteria will be applied post-index seek, post-bookmark lookup. Very efficient.
You want to optimize for the index seek. The checksum is already highly selective. Adding the FKs would increase the index size and corresponding I/O, and wouldn't help unless it included enough other fields to avoid the bookmark lookup altogether.
Since the non-clustered index will contain the clustering keys or heap pointer, you want either a) a small clustering key (eg, an int identity column--4 byte pointer) or b) no clustered index at all (8 byte pointer).
If your queries are not selective, or if the line table clustered index is huge (the entire table minus a few columns) then I don't know if the checksum would help (faster index navigation, perhaps?). In any case you would want to make it a clustered or covering index, and if the header table isn't clustered on the checksum first, there will be much sorting.
If you can afford the storage and indexing costs, a few covering indexes--header and detail--may be the way to go.
如果您的
PRIMARY KEY
是聚集的,那么您创建的每个索引都将包含此PRIMARY KEY
。连接哈希值将使用以下步骤:
PRIMARY KEY
值Clustered Index Seek
定位表中的PRIMARY KEY
行行
PRIMARY KEY
将仅使用步骤3
。然而,
SQL Server
足够聪明,能够考虑到这一点,如果您像这样加入:,它不会使用
HASH_KEY
上的索引,而是使用将使用单个Clustered Index Seek
和一个Filter
来确保哈希值匹配(而且它们总是会匹配)。摘要:只需加入
PRIMARY KEY
即可。使用二级索引,您首先需要进行无用的
HASH_KEY
搜索,然后仍然需要加入PRIMARY KEY
。IF your
PRIMARY KEY
is clustered, then each index you create will contain thisPRIMARY KEY
.Joining on a hashed value will use this following steps:
PRIMARY KEY
value in the index dataClustered Index Seek
to locate thePRIMARY KEY
row in the tableJoining on a
PRIMARY KEY
will use only the step3
.SQL Server
, however, is smart enough to take this into account, and if you will join like this:, it just will not use the index on
HASH_KEY
, instead, it will use a singleClustered Index Seek
and aFilter
to make sure the hash values match (and they always will).Summary: just join on the
PRIMARY KEY
.Using a secondary index, you'll first need to do a useless
HASH_KEY
search, and then still need to join on thePRIMARY KEY
.