INT 和 VARCHAR 主键之间是否存在真正的性能差异？

发布于 2024-07-09 01:19:23 字数 392 浏览 13 评论 0原文

在 MySQL 中使用 INT 与 VARCHAR 作为主键之间是否存在可测量的性能差异？我想使用 VARCHAR 作为参考列表的主键（想想美国各州、国家/地区代码），并且同事不会在 INT AUTO_INCRMENT 作为所有表的主键上让步。

我的论点，详见此处，INT 和 VARCHAR 之间的性能差异可以忽略不计，因为每个 INT 外键引用都需要 JOIN 来理解引用，所以 VARCHAR 键将直接呈现信息。

那么，是否有人对这个特定用例以及与之相关的性能问题有经验？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧故 2024-07-16 01:19:23

我对网上缺乏基准测试感到有点恼火，所以我自己进行了测试。

请注意，虽然我不会定期进行此操作，所以请检查我的设置和步骤是否有任何可能无意中影响结果的因素，并在评论中发表您的疑虑。

设置如下如下：

Intel® Core™ i7-7500U CPU @ 2.70GHz × 4
15.6 GiB RAM，我在测试过程中确保了大约 8 GB 的可用空间。
148.6 GB SSD 驱动器，具有充足的可用空间。
Ubuntu 16.04 64-bit
MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)

表：

create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;

然后，我用 PHP 脚本填充每个表中的 1000 万行，其本质如下：

$pdo = get_pdo();

$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];

for ($k = 0; $k < 10; $k++) {
    for ($j = 0; $j < 1000; $j++) {
        $val = '';
        for ($i = 0; $i < 1000; $i++) {
            $val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
        }
        $val = rtrim($val, ',');
        $pdo->query('INSERT INTO jan_char VALUES ' . $val);
    }
    echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}

For int表中，位 ($keys[rand(0, 9)]) 被替换为 rand(0, 9)，对于 varchar 表中，我使用了完整的美国州名，没有将其剪切或扩展至 6 个字符。 generate_random_string() 生成一个 10 个字符的随机字符串。

然后我在 MySQL 中运行：

SET SESSION query_cache_type=0;
对于 jan_int 表：
- SELECT count(*) FROM jan_int WHERE myindex = 5;
- SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
对于其他表，同上，char 表使用 myindex = 'califo'，myindex = 'california' > 对于 varchar 表。

每个表上 BENCHMARK 查询的次数：

jan_int: 21.30 sec
jan_int_index: 18.79 sec
jan_char: 21.70 sec
: 18.85 sec
jan_varchar: 21.76 sec
jan_varchar_index: 18.86 sec

jan_char_index 索引大小，这是 show table status from janperformancetest; 的输出（有几列未显示）：

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name              | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int           | InnoDB |      10 | Dynamic    | 9739094 |             43 |   422510592 |               0 |            0 |   4194304 |           NULL | utf8mb4_unicode_520_ci |  
| jan_int_index     | InnoDB |      10 | Dynamic    | 9740329 |             43 |   420413440 |               0 |    132857856 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_char          | InnoDB |      10 | Dynamic    | 9726613 |             51 |   500170752 |               0 |            0 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_char_index    | InnoDB |      10 | Dynamic    | 9719059 |             52 |   513802240 |               0 |    202342400 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_varchar       | InnoDB |      10 | Dynamic    | 9722049 |             53 |   521142272 |               0 |            0 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_varchar_index | InnoDB |      10 | Dynamic    | 9738381 |             49 |   486539264 |               0 |    202375168 |   7340032 |           NULL | utf8mb4_unicode_520_ci | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

我的结论是，此特定用例没有性能差异。

I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.

Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.

The setup was as follows:

Intel® Core™ i7-7500U CPU @ 2.70GHz × 4
15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
148.6 GB SSD drive, with plenty of free space.
Ubuntu 16.04 64-bit
MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)

The tables:

create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;

Then, I filled 10 million rows in each table with a PHP script whose essence is like this:

$pdo = get_pdo();

$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];

for ($k = 0; $k < 10; $k++) {
    for ($j = 0; $j < 1000; $j++) {
        $val = '';
        for ($i = 0; $i < 1000; $i++) {
            $val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
        }
        $val = rtrim($val, ',');
        $pdo->query('INSERT INTO jan_char VALUES ' . $val);
    }
    echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}

For int tables, the bit ($keys[rand(0, 9)]) was replaced with just rand(0, 9), and for varchar tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string() generates a 10-character random string.

Then I ran in MySQL:

SET SESSION query_cache_type=0;
For jan_int table:
- SELECT count(*) FROM jan_int WHERE myindex = 5;
- SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.

Times of the BENCHMARK query on each table:

jan_int: 21.30 sec
jan_int_index: 18.79 sec
jan_char: 21.70 sec
jan_char_index: 18.85 sec
jan_varchar: 21.76 sec
jan_varchar_index: 18.86 sec

Regarding table & index sizes, here's the output of show table status from janperformancetest; (w/ a few columns not shown):

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name              | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int           | InnoDB |      10 | Dynamic    | 9739094 |             43 |   422510592 |               0 |            0 |   4194304 |           NULL | utf8mb4_unicode_520_ci |  
| jan_int_index     | InnoDB |      10 | Dynamic    | 9740329 |             43 |   420413440 |               0 |    132857856 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_char          | InnoDB |      10 | Dynamic    | 9726613 |             51 |   500170752 |               0 |            0 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_char_index    | InnoDB |      10 | Dynamic    | 9719059 |             52 |   513802240 |               0 |    202342400 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_varchar       | InnoDB |      10 | Dynamic    | 9722049 |             53 |   521142272 |               0 |            0 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_varchar_index | InnoDB |      10 | Dynamic    | 9738381 |             49 |   486539264 |               0 |    202375168 |   7340032 |           NULL | utf8mb4_unicode_520_ci | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

My conclusion is that there's no performance difference for this particular use case.

回复收藏 0 原文

浅忆流年 2024-07-16 01:19:23

您提出了一个很好的观点，即可以通过使用所谓的自然键<来避免一些连接查询/em> 而不是代理键。只有您可以评估这样做的好处在您的应用程序中是否显着。

也就是说，您可以衡量应用程序中对于速度最重要的查询，因为它们处理大量数据或者执行非常频繁。如果这些查询受益于消除联接，并且不会因使用 varchar 主键而受到影响，那么就这样做。

不要对数据库中的所有表使用任一策略。在某些情况下，自然键可能更好，但在其他情况下代理键可能更好。

其他人提出了一个很好的观点，即在实践中自然键永远不会改变或有重复项的情况很少见，因此代理键通常是值得的。

回复收藏 0 原文

吃兔兔 2024-07-16 01:19:23

这与性能无关。这是关于什么是好的主键。独特且随着时间的推移而不变。您可能认为诸如国家/地区代码之类的实体永远不会随时间而变化，并且是主键的良好候选者。但痛苦的经历却很少如此。

INT AUTO_INCREMENT 满足“唯一且随时间不变”的条件。因此偏爱。

回复收藏 0 原文

眉目亦如画i 2024-07-16 01:19:23

绝对不。

我已经在 INT、VARCHAR 和 CHAR 之间进行了几次...几次...性能检查。

无论我使用这三个表中的哪一个，具有主键（唯一且集群）的 1000 万条记录表都具有完全相同的速度和性能（以及子树成本）。

话虽这么说......使用最适合您的应用程序的东西。不用担心性能。

回复收藏 0 原文

等数载，海棠开 2024-07-16 01:19:23

取决于长度..如果 varchar 为 20 个字符，而 int 为 4，那么如果您使用 int，您的索引将在磁盘上每页索引空间上拥有五倍的节点数...这意味着遍历索引将需要五分之一的物理和/或逻辑读取。

因此，如果性能是一个问题，只要有机会，请始终为表和外键使用完整的无意义键（称为代理项）引用这些表中的行...

同时，为了保证数据一致性，每个重要的表都应该还有一个有意义的非数字备用键（或唯一索引）以确保无法插入重复行（基于有意义的表属性进行重复）。

对于您正在讨论的特定用途（例如状态查找），这实际上并不重要，因为表的大小非常小。一般来说，少于几千行的表上的索引对性能没有影响。 ..

回复收藏 0 原文

树深时见影 2024-07-16 01:19:23

对于短代码，可能没有区别。尤其如此，因为保存这些代码的表可能非常小（最多几千行）并且不经常更改（我们最后一次添加新的美国州是什么时候）。

对于键之间变化较大的较大表，这可能很危险。例如，考虑使用用户表中的电子邮件地址/用户名。当您有几百万用户并且其中一些用户的名称或电子邮件地址很长时，会发生什么情况？现在，任何时候您需要使用该键加入该表时，它都会变得更加昂贵。

回复收藏 0 原文

你是暖光i 2024-07-16 01:19:23

至于主键，任何物理上使行唯一的东西都应该被确定为主键。

对于作为外键的引用，使用自动递增整数作为代理是一个好主意，主要有两个原因。

- 首先，通常连接产生的开销较少。
- 其次，如果您需要更新包含唯一 varchar 的表，则更新必须向下级联到所有子表并更新所有子表以及索引，而使用 int 代理项时，它只需要更新主表及其索引。

使用代理项的缺点是您可能允许更改代理项的含义：

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

这完全取决于您在结构中真正需要担心的内容以及最重要的含义。

As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.

回复收藏 0 原文

枯叶蝶 2024-07-16 01:19:23

代理 AUTO_INCRMENT 造成伤害的常见情况：

常见的模式模式是多对多映射：

CREATE TABLE map (
    id ... AUTO_INCREMENT,
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(id),
    UNIQUE(foo_id, bar_id),
    INDEX(bar_id) );

这种模式的性能要好得多，特别是在使用 InnoDB 时：

CREATE TABLE map (
    # No surrogate
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(foo_id, bar_id),
    INDEX      (bar_id, foo_id) );

为什么？

InnoDB辅助键需要额外的查找；通过将这一对移入 PK，就可以避免一个方向的情况。
二级索引是“覆盖”的，因此不需要额外的查找。
由于去掉了 id 和一个索引，该表变得更小。

另一种情况（国家）：

country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii

新手经常将country_code标准化为4字节INT，而不是使用“自然”2字节、几乎不变的2字节细绳。更快、更小、更少的 JOIN、更具可读性。

Common cases where a surrogate AUTO_INCREMENT hurts:

A common schema pattern is a many-to-many mapping:

CREATE TABLE map (
    id ... AUTO_INCREMENT,
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(id),
    UNIQUE(foo_id, bar_id),
    INDEX(bar_id) );

Performance of this pattern is much better, especially when using InnoDB:

CREATE TABLE map (
    # No surrogate
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(foo_id, bar_id),
    INDEX      (bar_id, foo_id) );

Why?

InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.
The secondary index is "covering", so it does not need the extra lookup.
This table is smaller because of getting rid of id and one index.

Another case (country):

country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii

All too often the novice normalizes country_code into a 4-byte INT instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.

回复收藏 0 原文

初心未许 2024-07-16 01:19:23

在 HauteLook，我们更改了许多表以使用自然键。我们确实体验到了性能的实际提升。正如您所提到的，我们的许多查询现在使用更少的联接，这使得查询性能更高。如果有意义的话，我们甚至会使用复合主键。话虽这么说，有些表如果有代理键就更容易使用。

另外，如果您让人们向您的数据库编写接口，则代理键可能会有所帮助。第三方可以信赖这样一个事实：代理键仅在极少数情况下才会更改。

回复收藏 0 原文

回忆追雨的时光 2024-07-16 01:19:23

我也面临着同样的困境。我制作了一个 DW（星座模式），其中包含 3 个事实表：道路事故、事故中的车辆和事故中的伤亡。数据包括1979年至2012年英国记录的所有事故，以及60个维度表。总共约有 2000 万条记录。

事实表关系：

+----------+          +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1      * +----v----+
     1|                    |1
      |    +----------+    |
      +---<| Casualty |>---+
         * +----------+ *

RDMS：MySQL 5.6

事故索引本身是一个 varchar（数字和字母），有 15 位数字。我尝试不使用代理键，一旦事故索引就永远不会改变。
在 i7（8 核）计算机中，根据维度加载 1200 万条记录后，DW 变得太慢而无法查询。
经过大量返工并添加 bigint 代理键后，我的速度性能平均提高了 20%。
虽然性能增益较低，但尝试有效。我从事 MySQL 调优和集群工作。

I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.

Fact tables relationships:

+----------+          +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1      * +----v----+
     1|                    |1
      |    +----------+    |
      +---<| Casualty |>---+
         * +----------+ *

RDMS: MySQL 5.6

Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change.
In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions.
After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost.
Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.

回复收藏 0 原文

迷迭香的记忆 2024-07-16 01:19:23

问题是关于 MySQL 的，所以我说有一个显着的区别。如果它是关于 Oracle（它将数字存储为字符串 - 是的，我一开始不敢相信），那么没有太大区别。

表中的存储不是问题，更新和引用索引才是问题。涉及根据主键查找记录的查询很频繁 - 您希望它们尽可能快地发生，因为它们发生得如此频繁。

问题是 CPU 在硅中自然地处理 4 字节和 8 字节整数。它比较两个整数的速度非常快 - 它发生在一两个时钟周期内。

现在看一个字符串 - 它由许多字符组成（现在每个字符超过一个字节）。比较两个字符串的优先级不可能在一两个周期内完成。相反，必须迭代字符串的字符，直到发现差异。我确信有一些技巧可以使其在某些数据库中更快，但这在这里无关紧要，因为 int 比较是由 CPU 在硅中自然完成的并且速度快如闪电。

我的一般规则 - 每个主键都应该是一个自动递增的 INT，特别是在使用 ORM（Hibernate、Datanucleus 等）的 OO 应用程序中，其中对象之间存在很多关系 - 它们通常总是被实现为一个简单的 FK 和快速解决这些问题的数据库对于您的应用程序非常重要”
的反应能力。

回复收藏 0 原文

牵你的手，一向走下去 2024-07-16 01:19:23

请允许我说，是的，考虑到性能范围（开箱即用的定义），肯定存在差异：

1-在应用程序中使用代理 int 更快，因为您不需要使用 ToUpper()、ToLower()、代码或查询中的 ToUpperInvarient() 或 ToLowerInvarient() ，这 4 个函数具有不同的性能基准。请参阅 Microsoft 的性能规则。（应用程序的性能）

2- 使用代理 int 保证密钥不会随着时间的推移而改变。即使国家/地区代码也可能会发生变化，请参阅维基百科 ISO 代码如何随时间变化。更改子树的主键将花费大量时间。（数据维护的性能）

3- ORM 解决方案似乎存在问题，例如当 PK/FK 不是 int 时的 NHibernate。（开发者表现）

回复收藏 0 原文

原来是傀儡 2024-07-16 01:19:23

不确定对性能的影响，但至少在开发过程中，似乎一种可能的妥协是包括自动递增的整数“代理”键，以及您想要的、唯一的“自然”键。这将使您有机会评估性能以及其他可能的问题，包括自然键的可变性。

回复收藏 0 原文

凉风有信 2024-07-16 01:19:23

和往常一样，没有一揽子答案。 '这取决于！' 我并不是在开玩笑。我对最初问题的理解是小表上的键 - 例如国家/地区（整数 id 或 char/varchar 代码）是潜在的巨大表（如地址/联系人表）的外键。

当您希望从数据库返回数据时，有两种情况。第一个是列表/搜索类型的查询，您想要列出带有州和国家/地区代码或名称的所有联系人（ID 没有帮助，因此需要查找）。另一种是主键上的获取场景，它显示单个联系人记录，其中需要显示州、国家/地区的名称。

对于后者，FK 基于什么可能并不重要，因为我们将单个记录或几个记录的表以及键读取放在一起。前一种（搜索或列表）场景可能会受到我们的选择的影响。由于需要显示国家/地区（至少是可识别的代码，甚至搜索本身可能包含国家/地区代码），因此不必通过代理键加入另一个表（我在这里只是保持谨慎，因为我没有实际测试过）这，但似乎很有可能）提高性能；尽管它确实有助于搜索。

由于代码尺寸较小（国家和州的代码通常不超过 3 个字符），因此在这种情况下可以使用自然键作为外键。

另一种情况是键依赖于较长的 varchar 值，并且可能依赖于较大的表；代理键可能具有优势。

回复收藏 0 原文

~没有更多了~