主键类型的选择

发布于 2024-08-02 20:11:10 字数 645 浏览 6 评论 0原文

我有一个表,每秒可能会有大量插入,并且我正在尝试选择我想要使用的主键类型。为了便于说明,我们假设它是用户表。我试图在使用 GUID 和 BIGINT 作为主键以及最终作为整个应用程序的 UserID 之间进行选择。如果我使用 GUID,我会保存一次数据库访问以生成新 ID,但 GUID 不是“用户友好的”,并且不可能通过此 ID 对表进行分区(我打算这样做)。使用 BIGINT 更方便,但生成它是一个问题 - 我不能使用 IDENTITY (这是有原因的),所以我唯一的选择是有一些包含上次使用的 ID 的辅助表,然后我称之为存储过程:

create proc GetNewID @ID BIGINT OUTPUT
as
begin
update HelperIDTable set @ID=id, id = id + 1 
end

获取新的 id。但是这个辅助表是一个明显的瓶颈,我关心它每秒可以执行多少次更新。

我真的很喜欢使用 BIGINT 作为 pk 的想法,但瓶颈问题让我担心 - 有没有办法粗略估计它每秒可以生成多少个 id?我意识到这很大程度上取决于硬件,但是是否存在任何物理限制以及我们正在考虑什么程度? 100 秒/秒? 1000 秒/秒?

任何有关如何解决该问题的想法都将受到高度赞赏!这个问题让我很多个晚上都睡不着觉!

谢谢! 安德烈

I have a table that potentially will have high number of inserts per second, and I'm trying to choose a type of primary key I want to use. For illustrative purposes let's say, it's users table. I am trying to chose between using GUID and BIGINT as primary key and ultimately as UserID across the app. If I use GUID, I save a trip to database to generate a new ID, but GUID is not "user-friendly" and it's not possible to partition table by this ID (which I'm planning to do). Using BIGINT is much more convenient, but generating it is a problem - I can't use IDENTITY (there is a reason fro that), so my only choice is to have some helper table that would contain last used ID and then I call this stored proc:

create proc GetNewID @ID BIGINT OUTPUT
as
begin
update HelperIDTable set @ID=id, id = id + 1 
end

to get the new id. But then this helper table is an obvious bottleneck and I'm concerned with how many updates per second it can do.

I really like the idea of using BIGINT as pk, but the bottleneck problem concerns me - is there a way to roughly estimate how many id's it could produce per second? I realize it highly depends on hardware, but are there any physical limitations and what degree are we looking at? 100's/sec? 1000's/sec?

Any ideas on how to approach the problem are highly appreciated! This problem doesn't let me sleep for many night now!

Thanks!
Andrey

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

小…楫夜泊 2024-08-09 20:11:10

GUID 似乎是一个自然的选择 - 如果您确实必须这样做,您可能会争论将其用作表的主键 - 唯一标识数据库中的行的单个值。

我强烈建议不要使用 GUID 列作为集群键,SQL Server 默认情况下会这样做,除非您明确告诉它不要这样做。

正如索引女王 Kimberly Tripp 和其他人所说的那样有时 - GUID 作为集群键并不是最佳选择,因为由于其随机性,它将导致大量页面和索引碎片,并且通常会导致性能不佳。

是的,我知道 - SQL Server 2005 及更高版本中有 newsequentialid() - 但即便如此,它也不是真正完全顺序的,因此也遇到了与 GUID 相同的问题 - 只是稍微不那么突出,所以。

然后还有另一个问题需要考虑:表上的聚集键也将添加到表上每个非聚集索引的每个条目中 - 因此您确实希望确保它尽可能小。通常,具有 2+ 十亿行的 INT 对于绝大多数表来说应该足够了 - 与作为集群键的 GUID 相比,您可以在磁盘和服务器内存中节省数百兆字节的存储空间。

总结一下:除非您有充分的理由,否则我总是建议使用 INT IDENTITY 字段作为表上的主/聚集键。

马克

GUID seem to be a natural choice - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table - the single value that uniquely identifies the row in the database.

What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

So to sum it up: unless you have a really good reason, I would always recommend a INT IDENTITY field as the primary / clustered key on your table.

Marc

烟燃烟灭 2024-08-09 20:11:10

我尝试对除小型查找表之外的所有表使用 GUID PK。 GUID 概念确保可以在内存中安全地创建对象的标识,而无需往返数据库并在以后保存而不更改标识。

当您需要“人类可读”的 id 时,您可以在保存时使用自动增量 int 。对于分区,您还可以稍后通过数据库计划一次性为许多用户创建 BIGINT。

I try to use GUID PKs for all tables except small lookup tables. The GUID concept ensures that the identity of the object can safely be created in memeory without a roundtrip to the database and saving later without changing the identity.

When you need a "human readable" id you can use an auto increment int when saved. For partitioning you could also create the BIGINTs later by a database schedule for many users in one shot.

吹泡泡o 2024-08-09 20:11:10

您是否出于业务原因需要主键,或者出于存储考虑而需要聚集键?
请参阅stackoverflow.com/questions/1151625/int-vs-unique -identifier-for-id-field-in-database 有关 PK 与聚集密钥主题的更详细的帖子。

您确实必须详细说明为什么不能使用 IDENTITY。手动生成 ID,特别是在服务器上,通过额外的往返和更新来生成插入的每个 ID,它不会扩展。如果能达到每秒 100 秒以下就很幸运了。问题不仅仅是往返和更新时间,而且主要来自 ID 生成更新与插入批处理的交互:插入批处理事务将序列化 ID 生成。解决方法是在单独的会话上分离 ID 生成,以便它可以自动提交,但是插入批处理是没有意义的,因为 ID 生成不是批处理的:它必须在生成每个 ID 后等待日志刷新命令提交。与此 uuid 相比,您的手动 ID 生成会绕圈运行。但由于碎片,uuid 对于聚集键来说是一个糟糕的选择。

Do you want a primary key, for business reasons, or a clustred key, for storage concerns?
See stackoverflow.com/questions/1151625/int-vs-unique-identifier-for-id-field-in-database for a more elaborate post on the topic of PK vs. clustered key.

You really have to elaborate why can't you use IDENTITY. Generating the IDs manually, and specially on the server with an extra rountrip and an update just to generate each ID for the insert it won't scale. You'd be lucky to reach lower 100s per second. The problem is not just the rountrip and update time, but primarily from the interaction of ID generation update with insert batching: the insert batching transaction will serialize ID generation. The woraround is to separate the ID generation on separate session so it can autocommit, but then the insert batching is pointless because the ID genartion is not batched: it has to wait for log flush after each ID genrated in order to commit. Compared to this uuid will be running circles around your manual ID generation. But uuid are horrible choice for clustred key because of fragmentation.

卷耳 2024-08-09 20:11:10

尝试使用脚本访问数据库,也许使用 jmeter 来模拟并发访问。也许您可以自己测量一下您可以承受多少负载。您的数据库也可能会导致瓶颈。是哪一个?我会优先选择 PostgreSQL 来处理重负载,就像 yahoo 和 Skype 一样

try to hit your db with a script, perhaps with the use of jmeter to simulate concurrent hits. Perhaps you can then just measure yourself how much load you can handle. Also your DB could cause a bottle neck. Which one is it? I would prefure PostgreSQL for heavy load, like yahoo and skype also do

止于盛夏 2024-08-09 20:11:10

一个需要认真测试的想法:尝试批量创建(插入)新行——比如一次 1000(10,000?1M?)。您可以有一个主表(也称为瓶颈表),列出下一个要使用的表,或者您可能有一个查询,该查询执行类似

 select min(id) where (name = '')

在早上、每小时或每当您达到某个特定值时生成一批新的空行。免费的数量。这仅解决了生成新 ID 的问题,但如果这是主要瓶颈,它可能会有所帮助。

表分区选项:假设有一个 bigint ID 列,您如何定义分区?如果每天允许 1G 行,则可以在晚上设置新分区(day1 = 1,000,000,000 到 1,999,999,999,day2 = 2,000,000,000 到 2,999,999,999 等),然后在准备就绪时将其交换。当然,您的分区数仅限于 1000 个,因此使用 bigint 时,您将在用完 ID 之前用完分区。

An idea that requires serious testing: try creating (inserting) new rows in batches -- say 1000 (10,000? 1M?) a time. You could have a master (aka bottleneck) table listing the next one to use, or you might have a query that does something like

 select min(id) where (name = '')

Generate a fresh batch of emtpy rows in the morning, every hour, or whenever you're down to a certain number of free ones. This only addresses the issue of generating new IDs, but if that's the main bottleneck it might help.

A table partitioning option: Assuming a bigint ID column, how are you defining the partition? If you are allowing for 1G rows per day, you could set up the new partition in the evening (day1 = 1,000,000,000 through 1,999,999,999, day2 = 2,000,000,000 through 2,999,999,999, etc.) and then swap it in when it's ready. You are of course limited to 1000 partitions, so with bigints you'll run out of partitions before you run out of IDs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文