GUID / UUID 数据库密钥的优缺点

发布于 2024-07-03 23:35:43 字数 262 浏览 6 评论 0 原文

我过去曾在许多数据库系统上工作过,如果所有数据库键都是GUID / UUID 值。 我曾多次考虑走这条路,但总是存在一些不确定性,特别是在性能和​​无法通过电话读出的 URL 方面。

有人在数据库中广泛使用过 GUID 吗? 这样做有什么好处,可能存在哪些陷阱?

I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.

Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

负佳期 2024-07-10 23:35:43

主要优点是您可以创建唯一的 ID,而无需连接到数据库。 而且 ID 是全球唯一的,因此您可以轻松组合来自不同数据库的数据。 这些看似很小的优势,但在过去为我节省了很多工作。

主要缺点是需要更多的存储空间(在现代系统上不是问题),并且 id 并不是真正人类可读的。 调试时这可能会出现问题。

存在一些性能问题,例如索引碎片。 但这些很容易解决(吉米·尼尔森的梳子指南:http://www.informit .com/articles/article.aspx?p=25862

编辑合并了我对此问题的两个答案

@Matt Sheppard 我认为他的意思是您可以使用不同的 GUID 作为主键来复制行。 这是任何类型的代理键的问题,而不仅仅是 GUID。 正如他所说,通过向非键列添加有意义的唯一约束可以轻松解决这个问题。 另一种方法是使用自然密钥,但它们确实存在问题。

The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.

The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.

There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )

Edit merged my two answers to this question

@Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..

你列表最软的妹 2024-07-10 23:35:43

有一点没有真正解决,即使用随机 (UUIDv4) ID 作为主键会损害主键索引的性能。 无论您的表是否围绕键聚集,这种情况都会发生。

RDBM通常保证主键的唯一性,并保证通过键进行查找,其结构称为BTree,它是一种具有大分支因子的搜索树(二叉搜索树的分支因子为2)。 现在,连续整数 ID 会导致插入仅发生在树的一侧,而大多数叶节点不受影响。 添加随机 UUID 将导致插入在整个索引上分割叶节点。

同样,如果存储的数据大部分是临时的,则通常需要访问最新的数据并将其连接到最多的数据。 对于随机 UUID,模式不会从中受益,并且会命中更多索引行,从而需要内存中更多索引页。 对于顺序 ID,如果最需要最新数据,则热索引页将需要更少的 RAM。

There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.

RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.

Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.

征﹌骨岁月お 2024-07-10 23:35:43

primary-keys-ids-versus-guid< /a>

GUID 作为主键的成本 (SQL Server 2000)

神话、GUID 与自动增量 (MySQL 5)

这确实是您想要的想。

UUID 优点

  • 在每个表、每个数据库、每个服务器中都是唯一的
  • 允许轻松合并来自不同数据库的记录
  • 跨多个服务器轻松分发数据库
  • 您可以在任何地方生成 ID,而无需往返数据库
  • 允许 复制场景无论如何都需要 GUID 列

GUID 缺点

  • 它比传统的 4 字节索引值大 4 倍; 如果您不小心,这可能会产生严重的性能和存储影响
  • 调试起来很麻烦(其中 userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
  • 生成的 GUID 应该部分连续以获得最佳性能(例如,newsequentialid() on SQL 2005)并启用聚集索引

primary-keys-ids-versus-guids

The Cost of GUIDs as Primary Keys (SQL Server 2000)

Myths, GUID vs. Autoincrement (MySQL 5)

This is realy what you want.

UUID Pros

  • Unique across every table, every database, every server
  • Allows easy merging of records from different databases
  • Allows easy distribution of databases across multiple servers
  • You can generate IDs anywhere, instead of having to roundtrip to the database
  • Most replication scenarios require GUID columns anyway

GUID Cons

  • It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
  • Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
  • The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
乱世争霸 2024-07-10 23:35:43

如果 GUID 被用作“唯一标识符”,让重复的数据进入您的表中,那么它们可能会在将来给您带来很多麻烦。 如果您想使用 GUID,请考虑仍然在其他列上维护 UNIQUE 约束。

GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).

玩心态 2024-07-10 23:35:43

如果您还将该列用作聚集索引(一种相对常见的做法),则使用 GUIDS 作为主键时需要考虑的另一个小问题。 由于 guid 的性质无论如何都不会按顺序开始,因此在插入时它们将是页面拆分等,因此您将在插入时受到影响。 如果系统将具有高 IO,则需要考虑一些事情......

One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...

笔芯 2024-07-10 23:35:43

优点:

  • 可以离线生成。
  • 使复制变得微不足道(与 int 不同,这使得复制变得非常困难)
  • ORM 通常喜欢它们
  • 在应用程序中是唯一的。 因此,我们可以在应用程序(也是 GUID)中使用 CMS(GUID)中的 PK,并且知道我们永远不会发生冲突。

缺点:

  • 占用空间较大,但空间便宜(呃)
  • 不能通过ID排序来获取插入订单。
  • 在 URL 中看起来可能很难看,但实际上,你到底在搞什么把真正的数据库密钥放在 URL 中!? (这一点在下面的评论中存在争议)
  • 手动调试更难,但也不是那么难。

就我个人而言,我在任何规模合适的系统中将它们用于大多数 PK,但我在一个到处复制的系统上接受了“训练”,所以我们必须拥有它们。 YMMV。

我认为重复数据的事情是垃圾——无论你怎么做,你都可以获得重复的数据。 在我工作的任何地方,代理键通常都会受到人们的反对。 不过,我们确实使用了类似 WordPress 的系统:

  • 行的唯一 ID(GUID/其他)。 用户永远不可见。
  • 公共 ID 从某个字段生成一次(例如标题 - 使其成为文章的标题)

更新:
所以这个得到了很多+1,我想我应该指出 GUID PK 的一个很大的缺点:聚集索引。

如果你有很多记录,并且 GUID 上有聚集索引,那么你的插入性能将会很糟糕,因为你会在项目列表中的随机位置插入(这就是重点),而不是在末尾(这很快)。

因此,如果您需要插入性能,也许可以使用 auto-inc INT,如果您想与其他人共享它(例如,在 URL 中向用户显示它),则可以生成 GUID。

Advantages:

  • Can generate them offline.
  • Makes replication trivial (as opposed to int's, which makes it REALLY hard)
  • ORM's usually like them
  • Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.

Disadvantages:

  • Larger space use, but space is cheap(er)
  • Can't order by ID to get the insert order.
  • Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
  • Harder to do manual debugging, but not that hard.

Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.

I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:

  • unique ID for the row (GUID/whatever). Never visible to the user.
  • public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)

UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.

If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).

So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).

暖伴 2024-07-10 23:35:43

目前还没有提到的一件事是:UUID 使得分析数据变得更加困难,

至少对于 Web 应用程序来说,使用 url 中的 id 访问资源是很常见的,例如 stackoverflow.com/questions/45399。 如果 id 是整数,则这两者都

  • 提供有关问题数量的信息(即 2008 年 9 月 5 日,提出了第 45,399 个问题),
  • 并提供了迭代问题的杠杆点(当我将其增加 1 时会发生什么?我打开下一个提出的问题)

从第一点开始,我可以结合问题的时间戳和数字来分析问题的提出频率以及随着时间的推移如何变化。 在像 Stack Overflow 这样具有公开信息的网站上,这一点不太重要,但是,根据上下文,这可能会暴露敏感信息。

例如,我是一家为客户提供权限门禁门户的公司。 地址是 portal.com/profile/{customerId}。 如果 id 是整数,您可以通过定期查询 lastKnownCustomerCount + 1 并检查结果是否为 404 - NotFound< 来分析客户数量,无论是否能够查看其信息。 /code> (客户不存在)或 403 - Forbidden (客户确实存在,但您无权查看)。

UUID 的非顺序性质缓解了这些问题。 这并不是为了防止分析,但它是一个开始。

One thing not mentioned so far: UUIDs make it much harder to profile data

For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both

  • provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)
  • provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)

From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.

For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).

UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.

微凉徒眸意 2024-07-10 23:35:43

优点:

  • UUID值在表和数据库之间是唯一的。 这就是为什么它可以合并两个数据库或分布式数据库之间的行。
  • UUID 通过 url 传递比整型数据更安全。
    如果通过url传递UUID,攻击者无法猜测下一个id。但是如果我们传递Integer类型,例如10,那么攻击者可以猜测下一个id是11然后12等。UUID
  • 可以离线生成。

Advantages:

  • UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.
  • UUID is more safer to pass through url than integer type data.
    If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
  • UUID can generate offline.
巡山小妖精 2024-07-10 23:35:43

@Matt Sheppard:

假设你有一桌顾客。 当然,您不希望某个客户在表中出现多次,否则整个销售和物流部门会发生很多混乱(特别是当有关客户的多行包含不同的信息时)。

因此,您有一个唯一标识客户的客户标识符,并且确保客户知道该标识符(在发票中),以便客户和客户服务人员在需要沟通时有一个共同的参考。 为了保证没有重复的客户记录,您可以通过客户标识符上的主键或通过客户标识符列上的 NOT NULL + UNIQUE 约束向表添加唯一性约束。

接下来,由于某种原因(我无法想到),系统会要求您向客户表添加一个 GUID 列并将其作为主键。 如果客户标识符列现在没有唯一性保证,那么您将在整个组织中自找麻烦,因为 GUID 始终是唯一的。

一些“架构师”可能会告诉您“哦,但是我们在应用程序层中处理真实客户唯一性约束!”。 正确的。 关于通用编程语言和(尤其是)中间层框架的时尚一直在变化,并且通常永远不会比数据库更长寿。 并且很有可能您在某些时候需要访问数据库而不需要通过当前的应用程序。 ==麻烦。 (但幸运的是,你和“架构师”早已离开,所以你不会在那里收拾残局。)换句话说:一定要在数据库中维护明显的约束(以及其他层,如果你有的话)时间)。

换句话说:可能有充分的理由将 GUID 列添加到表中,但请不要陷入诱惑,从而降低您对真实(==非 GUID)内一致性的抱负。信息。

@Matt Sheppard:

Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).

So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.

Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.

Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).

In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.

狼亦尘 2024-07-10 23:35:43

为什么没有人提到性能? 当你有多个连接时,所有这些都基于这些令人讨厌的 GUID,性能将会下降,一直在那里:(

Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文