我过去曾在许多数据库系统上工作过,如果所有数据库键都是GUID / UUID 值。 我曾多次考虑走这条路,但总是存在一些不确定性,特别是在性能和无法通过电话读出的 URL 方面。
有人在数据库中广泛使用过 GUID 吗? 这样做有什么好处,可能存在哪些陷阱?
I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.
Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?
发布评论
评论(10)
主要优点是您可以创建唯一的 ID,而无需连接到数据库。 而且 ID 是全球唯一的,因此您可以轻松组合来自不同数据库的数据。 这些看似很小的优势,但在过去为我节省了很多工作。
主要缺点是需要更多的存储空间(在现代系统上不是问题),并且 id 并不是真正人类可读的。 调试时这可能会出现问题。
存在一些性能问题,例如索引碎片。 但这些很容易解决(吉米·尼尔森的梳子指南:http://www.informit .com/articles/article.aspx?p=25862 )
编辑合并了我对此问题的两个答案
@Matt Sheppard 我认为他的意思是您可以使用不同的 GUID 作为主键来复制行。 这是任何类型的代理键的问题,而不仅仅是 GUID。 正如他所说,通过向非键列添加有意义的唯一约束可以轻松解决这个问题。 另一种方法是使用自然密钥,但它们确实存在问题。
The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.
The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.
There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )
Edit merged my two answers to this question
@Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..
有一点没有真正解决,即使用随机 (UUIDv4) ID 作为主键会损害主键索引的性能。 无论您的表是否围绕键聚集,这种情况都会发生。
RDBM通常保证主键的唯一性,并保证通过键进行查找,其结构称为BTree,它是一种具有大分支因子的搜索树(二叉搜索树的分支因子为2)。 现在,连续整数 ID 会导致插入仅发生在树的一侧,而大多数叶节点不受影响。 添加随机 UUID 将导致插入在整个索引上分割叶节点。
同样,如果存储的数据大部分是临时的,则通常需要访问最新的数据并将其连接到最多的数据。 对于随机 UUID,模式不会从中受益,并且会命中更多索引行,从而需要内存中更多索引页。 对于顺序 ID,如果最需要最新数据,则热索引页将需要更少的 RAM。
There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.
RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.
Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.
primary-keys-ids-versus-guid< /a>
GUID 作为主键的成本 (SQL Server 2000)
神话、GUID 与自动增量 (MySQL 5)
这确实是您想要的想。
UUID 优点
GUID 缺点
primary-keys-ids-versus-guids
The Cost of GUIDs as Primary Keys (SQL Server 2000)
Myths, GUID vs. Autoincrement (MySQL 5)
This is realy what you want.
UUID Pros
GUID Cons
如果 GUID 被用作“唯一标识符”,让重复的数据进入您的表中,那么它们可能会在将来给您带来很多麻烦。 如果您想使用 GUID,请考虑仍然在其他列上维护 UNIQUE 约束。
GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).
如果您还将该列用作聚集索引(一种相对常见的做法),则使用 GUIDS 作为主键时需要考虑的另一个小问题。 由于 guid 的性质无论如何都不会按顺序开始,因此在插入时它们将是页面拆分等,因此您将在插入时受到影响。 如果系统将具有高 IO,则需要考虑一些事情......
One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...
优点:
缺点:
就我个人而言,我在任何规模合适的系统中将它们用于大多数 PK,但我在一个到处复制的系统上接受了“训练”,所以我们必须拥有它们。 YMMV。
我认为重复数据的事情是垃圾——无论你怎么做,你都可以获得重复的数据。 在我工作的任何地方,代理键通常都会受到人们的反对。 不过,我们确实使用了类似 WordPress 的系统:
更新:
所以这个得到了很多+1,我想我应该指出 GUID PK 的一个很大的缺点:聚集索引。
如果你有很多记录,并且 GUID 上有聚集索引,那么你的插入性能将会很糟糕,因为你会在项目列表中的随机位置插入(这就是重点),而不是在末尾(这很快)。
因此,如果您需要插入性能,也许可以使用 auto-inc INT,如果您想与其他人共享它(例如,在 URL 中向用户显示它),则可以生成 GUID。
Advantages:
Disadvantages:
Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.
I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:
UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.
If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).
So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).
目前还没有提到的一件事是:UUID 使得分析数据变得更加困难,
至少对于 Web 应用程序来说,使用 url 中的 id 访问资源是很常见的,例如
stackoverflow.com/questions/45399
。 如果 id 是整数,则这两者都从第一点开始,我可以结合问题的时间戳和数字来分析问题的提出频率以及随着时间的推移如何变化。 在像 Stack Overflow 这样具有公开信息的网站上,这一点不太重要,但是,根据上下文,这可能会暴露敏感信息。
例如,我是一家为客户提供权限门禁门户的公司。 地址是
portal.com/profile/{customerId}
。 如果 id 是整数,您可以通过定期查询lastKnownCustomerCount + 1
并检查结果是否为404 - NotFound< 来分析客户数量,无论是否能够查看其信息。 /code> (客户不存在)或
403 - Forbidden
(客户确实存在,但您无权查看)。UUID 的非顺序性质缓解了这些问题。 这并不是为了防止分析,但它是一个开始。
One thing not mentioned so far: UUIDs make it much harder to profile data
For web apps at least, it's common to access a resource with the id in the url, like
stackoverflow.com/questions/45399
. If the id is an integer, this bothFrom the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.
For example, I am a company that offers customers a permissions gated portal. the address is
portal.com/profile/{customerId}
. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying forlastKnownCustomerCount + 1
regularly, and checking if the result is404 - NotFound
(customer does not exist) or403 - Forbidden
(customer does exist, but you do not have access to view).UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.
优点:
如果通过url传递UUID,攻击者无法猜测下一个id。但是如果我们传递Integer类型,例如10,那么攻击者可以猜测下一个id是11然后12等。UUID
Advantages:
If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
@Matt Sheppard:
假设你有一桌顾客。 当然,您不希望某个客户在表中出现多次,否则整个销售和物流部门会发生很多混乱(特别是当有关客户的多行包含不同的信息时)。
因此,您有一个唯一标识客户的客户标识符,并且确保客户知道该标识符(在发票中),以便客户和客户服务人员在需要沟通时有一个共同的参考。 为了保证没有重复的客户记录,您可以通过客户标识符上的主键或通过客户标识符列上的 NOT NULL + UNIQUE 约束向表添加唯一性约束。
接下来,由于某种原因(我无法想到),系统会要求您向客户表添加一个 GUID 列并将其作为主键。 如果客户标识符列现在没有唯一性保证,那么您将在整个组织中自找麻烦,因为 GUID 始终是唯一的。
一些“架构师”可能会告诉您“哦,但是我们在应用程序层中处理真实客户唯一性约束!”。 正确的。 关于通用编程语言和(尤其是)中间层框架的时尚一直在变化,并且通常永远不会比数据库更长寿。 并且很有可能您在某些时候需要访问数据库而不需要通过当前的应用程序。 ==麻烦。 (但幸运的是,你和“架构师”早已离开,所以你不会在那里收拾残局。)换句话说:一定要在数据库中维护明显的约束(以及其他层,如果你有的话)时间)。
换句话说:可能有充分的理由将 GUID 列添加到表中,但请不要陷入诱惑,从而降低您对真实(==非 GUID)内一致性的抱负。信息。
@Matt Sheppard:
Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).
So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.
Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.
Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).
In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.
为什么没有人提到性能? 当你有多个连接时,所有这些都基于这些令人讨厌的 GUID,性能将会下降,一直在那里:(
Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(