INT 和 VARCHAR 主键之间是否存在真正的性能差异?
在 MySQL 中使用 INT 与 VARCHAR 作为主键之间是否存在可测量的性能差异? 我想使用 VARCHAR 作为参考列表的主键(想想美国各州、国家/地区代码),并且同事不会在 INT AUTO_INCRMENT 作为所有表的主键上让步。
我的论点,详见此处,INT 和 VARCHAR 之间的性能差异可以忽略不计,因为每个 INT 外键引用都需要 JOIN 来理解引用,所以 VARCHAR 键将直接呈现信息。
那么,是否有人对这个特定用例以及与之相关的性能问题有经验?
Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.
My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.
So, does anyone have experience with this particular use-case and the performance concerns associated with it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
我对网上缺乏基准测试感到有点恼火,所以我自己进行了测试。
请注意,虽然我不会定期进行此操作,所以请检查我的设置和步骤是否有任何可能无意中影响结果的因素,并在评论中发表您的疑虑。
设置如下如下:
表:
然后,我用 PHP 脚本填充每个表中的 1000 万行,其本质如下:
For
int
表中,位($keys[rand(0, 9)])
被替换为rand(0, 9)
,对于varchar
表中,我使用了完整的美国州名,没有将其剪切或扩展至 6 个字符。generate_random_string()
生成一个 10 个字符的随机字符串。然后我在 MySQL 中运行:
SET SESSION query_cache_type=0;
jan_int
表:SELECT count(*) FROM jan_int WHERE myindex = 5;
SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
char
表使用myindex = 'califo'
,myindex = 'california'
> 对于varchar
表。每个表上
BENCHMARK
查询的次数:jan_char_index 索引大小,这是
show table status from janperformancetest;
的输出(有几列未显示):我的结论是,此特定用例没有性能差异。
I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.
Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.
The setup was as follows:
The tables:
Then, I filled 10 million rows in each table with a PHP script whose essence is like this:
For
int
tables, the bit($keys[rand(0, 9)])
was replaced with justrand(0, 9)
, and forvarchar
tables, I used full US state names, without cutting or extending them to 6 characters.generate_random_string()
generates a 10-character random string.Then I ran in MySQL:
SET SESSION query_cache_type=0;
jan_int
table:SELECT count(*) FROM jan_int WHERE myindex = 5;
SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
myindex = 'califo'
forchar
tables andmyindex = 'california'
forvarchar
tables.Times of the
BENCHMARK
query on each table:Regarding table & index sizes, here's the output of
show table status from janperformancetest;
(w/ a few columns not shown):My conclusion is that there's no performance difference for this particular use case.
您提出了一个很好的观点,即可以通过使用所谓的自然键<来避免一些连接查询/em> 而不是代理键。 只有您可以评估这样做的好处在您的应用程序中是否显着。
也就是说,您可以衡量应用程序中对于速度最重要的查询,因为它们处理大量数据或者执行非常频繁。 如果这些查询受益于消除联接,并且不会因使用 varchar 主键而受到影响,那么就这样做。
不要对数据库中的所有表使用任一策略。 在某些情况下,自然键可能更好,但在其他情况下代理键可能更好。
其他人提出了一个很好的观点,即在实践中自然键永远不会改变或有重复项的情况很少见,因此代理键通常是值得的。
You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.
That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.
Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.
Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.
这与性能无关。 这是关于什么是好的主键。 独特且随着时间的推移而不变。 您可能认为诸如国家/地区代码之类的实体永远不会随时间而变化,并且是主键的良好候选者。 但痛苦的经历却很少如此。
INT AUTO_INCREMENT 满足“唯一且随时间不变”的条件。 因此偏爱。
It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.
INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.
绝对不。
我已经在 INT、VARCHAR 和 CHAR 之间进行了几次...几次...性能检查。
无论我使用这三个表中的哪一个,具有主键(唯一且集群)的 1000 万条记录表都具有完全相同的速度和性能(以及子树成本)。
话虽这么说......使用最适合您的应用程序的东西。 不用担心性能。
Absolutely not.
I have done several... several... performance checks between INT, VARCHAR, and CHAR.
10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.
That being said... use whatever is best for your application. Don't worry about the performance.
取决于长度..如果 varchar 为 20 个字符,而 int 为 4,那么如果您使用 int,您的索引将在磁盘上每页索引空间上拥有五倍的节点数...这意味着遍历索引将需要五分之一的物理和/或逻辑读取。
因此,如果性能是一个问题,只要有机会,请始终为表和外键使用完整的无意义键(称为代理项)引用这些表中的行...
同时,为了保证数据一致性,每个重要的表都应该还有一个有意义的非数字备用键(或唯一索引)以确保无法插入重复行(基于有意义的表属性进行重复)。
对于您正在讨论的特定用途(例如状态查找),这实际上并不重要,因为表的大小非常小。一般来说,少于几千行的表上的索引对性能没有影响。 ..
Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..
So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...
At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .
For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...
对于短代码,可能没有区别。 尤其如此,因为保存这些代码的表可能非常小(最多几千行)并且不经常更改(我们最后一次添加新的美国州是什么时候)。
对于键之间变化较大的较大表,这可能很危险。 例如,考虑使用用户表中的电子邮件地址/用户名。 当您有几百万用户并且其中一些用户的名称或电子邮件地址很长时,会发生什么情况? 现在,任何时候您需要使用该键加入该表时,它都会变得更加昂贵。
For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).
For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.
至于主键,任何物理上使行唯一的东西都应该被确定为主键。
对于作为外键的引用,使用自动递增整数作为代理是一个好主意,主要有两个原因。
- 首先,通常连接产生的开销较少。
- 其次,如果您需要更新包含唯一 varchar 的表,则更新必须向下级联到所有子表并更新所有子表以及索引,而使用 int 代理项时,它只需要更新主表及其索引。
使用代理项的缺点是您可能允许更改代理项的含义:
这完全取决于您在结构中真正需要担心的内容以及最重要的含义。
As for Primary Key, whatever physically makes a row unique should be determined as the primary key.
For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.
The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:
It all depends on what you really need to worry about in your structure and what means most.
代理
AUTO_INCRMENT
造成伤害的常见情况:常见的模式模式是多对多映射:
这种模式的性能要好得多,特别是在使用 InnoDB 时:
为什么?
另一种情况(国家):
新手经常将country_code标准化为4字节
INT
,而不是使用“自然”2字节、几乎不变的2字节细绳。 更快、更小、更少的 JOIN、更具可读性。Common cases where a surrogate
AUTO_INCREMENT
hurts:A common schema pattern is a many-to-many mapping:
Performance of this pattern is much better, especially when using InnoDB:
Why?
id
and one index.Another case (country):
All too often the novice normalizes country_code into a 4-byte
INT
instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.在 HauteLook,我们更改了许多表以使用自然键。 我们确实体验到了性能的实际提升。 正如您所提到的,我们的许多查询现在使用更少的联接,这使得查询性能更高。 如果有意义的话,我们甚至会使用复合主键。 话虽这么说,有些表如果有代理键就更容易使用。
另外,如果您让人们向您的数据库编写接口,则代理键可能会有所帮助。 第三方可以信赖这样一个事实:代理键仅在极少数情况下才会更改。
At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.
Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.
我也面临着同样的困境。 我制作了一个 DW(星座模式),其中包含 3 个事实表:道路事故、事故中的车辆和事故中的伤亡。 数据包括1979年至2012年英国记录的所有事故,以及60个维度表。 总共约有 2000 万条记录。
事实表关系:
RDMS:MySQL 5.6
事故索引本身是一个 varchar(数字和字母),有 15 位数字。 我尝试不使用代理键,一旦事故索引就永远不会改变。
在 i7(8 核)计算机中,根据维度加载 1200 万条记录后,DW 变得太慢而无法查询。
经过大量返工并添加 bigint 代理键后,我的速度性能平均提高了 20%。
虽然性能增益较低,但尝试有效。 我从事 MySQL 调优和集群工作。
I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.
Fact tables relationships:
RDMS: MySQL 5.6
Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change.
In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions.
After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost.
Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.
问题是关于 MySQL 的,所以我说有一个显着的区别。 如果它是关于 Oracle(它将数字存储为字符串 - 是的,我一开始不敢相信),那么没有太大区别。
表中的存储不是问题,更新和引用索引才是问题。 涉及根据主键查找记录的查询很频繁 - 您希望它们尽可能快地发生,因为它们发生得如此频繁。
问题是 CPU 在硅中自然地处理 4 字节和 8 字节整数。 它比较两个整数的速度非常快 - 它发生在一两个时钟周期内。
现在看一个字符串 - 它由许多字符组成(现在每个字符超过一个字节)。 比较两个字符串的优先级不可能在一两个周期内完成。 相反,必须迭代字符串的字符,直到发现差异。 我确信有一些技巧可以使其在某些数据库中更快,但这在这里无关紧要,因为 int 比较是由 CPU 在硅中自然完成的并且速度快如闪电。
我的一般规则 - 每个主键都应该是一个自动递增的 INT,特别是在使用 ORM(Hibernate、Datanucleus 等)的 OO 应用程序中,其中对象之间存在很多关系 - 它们通常总是被实现为一个简单的 FK 和快速解决这些问题的数据库对于您的应用程序非常重要”
的反应能力。
The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.
Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.
The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.
Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.
My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app'
s responsiveness.
请允许我说,是的,考虑到性能范围(开箱即用的定义),肯定存在差异:
1-在应用程序中使用代理 int 更快,因为您不需要使用 ToUpper()、ToLower()、代码或查询中的 ToUpperInvarient() 或 ToLowerInvarient() ,这 4 个函数具有不同的性能基准。 请参阅 Microsoft 的性能规则。 (应用程序的性能)
2- 使用代理 int 保证密钥不会随着时间的推移而改变。 即使国家/地区代码也可能会发生变化,请参阅维基百科 ISO 代码如何随时间变化。 更改子树的主键将花费大量时间。 (数据维护的性能)
3- ORM 解决方案似乎存在问题,例如当 PK/FK 不是 int 时的 NHibernate。 (开发者表现)
Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):
1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)
2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)
3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)
不确定对性能的影响,但至少在开发过程中,似乎一种可能的妥协是包括自动递增的整数“代理”键,以及您想要的、唯一的“自然”键。 这将使您有机会评估性能以及其他可能的问题,包括自然键的可变性。
Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.
和往常一样,没有一揽子答案。 '这取决于!' 我并不是在开玩笑。 我对最初问题的理解是小表上的键 - 例如国家/地区(整数 id 或 char/varchar 代码)是潜在的巨大表(如地址/联系人表)的外键。
当您希望从数据库返回数据时,有两种情况。 第一个是列表/搜索类型的查询,您想要列出带有州和国家/地区代码或名称的所有联系人(ID 没有帮助,因此需要查找)。 另一种是主键上的获取场景,它显示单个联系人记录,其中需要显示州、国家/地区的名称。
对于后者,FK 基于什么可能并不重要,因为我们将单个记录或几个记录的表以及键读取放在一起。 前一种(搜索或列表)场景可能会受到我们的选择的影响。 由于需要显示国家/地区(至少是可识别的代码,甚至搜索本身可能包含国家/地区代码),因此不必通过代理键加入另一个表(我在这里只是保持谨慎,因为我没有实际测试过)这,但似乎很有可能)提高性能; 尽管它确实有助于搜索。
由于代码尺寸较小(国家和州的代码通常不超过 3 个字符),因此在这种情况下可以使用自然键作为外键。
另一种情况是键依赖于较长的 varchar 值,并且可能依赖于较大的表; 代理键可能具有优势。
As usual, there are no blanket answers. 'It depends!' and I am not being facetious. My understanding of the original question was for keys on small tables - like Country (integer id or char/varchar code) being a foreign key to a potentially huge table like address/contact table.
There are two scenarios here when you want data back from the DB. First is a list/search kind of query where you want to list all the contacts with state and country codes or names (ids will not help and hence will need a lookup). The other is a get scenario on primary key which shows a single contact record where the name of the state, country needs to be shown.
For the latter get, it probably does not matter what the FK is based on since we are bringing together tables for a single record or a few records and on key reads. The former (search or list) scenario may be impacted by our choice. Since it is required to show country (at least a recognizable code and perhaps even the search itself includes a country code), not having to join another table through a surrogate key can potentially (I am just being cautious here because I have not actually tested this, but seems highly probable) improve performance; notwithstanding the fact that it certainly helps with the search.
As codes are small in size - not more than 3 chars usually for country and state, it may be okay to use the natural keys as foreign keys in this scenario.
The other scenario where keys are dependent on longer varchar values and perhaps on larger tables; the surrogate key probably has the advantage.