琐碎的任务 - 复杂的解决方案?

发布于 2024-12-12 02:44:12 字数 597 浏览 0 评论 0 原文

有一个小问题:

  • 将 uniqueidentifier 分配给任何 externalId
  • 分配后不要覆盖 uniqueidentifier - 只需返回现有的 uniqueidentifier

想象一个表

  ExternalId | Guid
--------------------------------
   some1     | accf-0334-dfdf-....

现在,问题在于规模。我们希望像这样映射数十亿个 externalId,并且我们需要能够快速分配这些标识符(数千/秒)。

我们从一个简单的 SQL Server 表开始,但它的性能不佳。我们将相同的模式移至 Cassandra ColumnFamily - 写入速度超快且分片,但是:在写入之前我们必须读取(以确保 externalId 尚未分配),因此我们再次达到读取查找 I/O 限制。

不幸的是,散列(以确定唯一标识符)是不可能的,因为我们已经分配了数亿个。缓存是有问题的,因为在大多数情况下,我们分配一个“全新的externalId”,因此它根本不会出现在数据库中。

有人对这里的解决方案有什么建议吗?

There is a trivial problem:

  • assign uniqueidentifier to any externalId
  • do not overwrite the uniqueidentifier once it is assigned - just return existing uniqueidentifier

Imagine a table

  ExternalId | Guid
--------------------------------
   some1     | accf-0334-dfdf-....

Now, the twist is the scale. We want billions of externalIds to be mapped like this and we need to be able to assign these identifiers fast (thousands/sec)

We started of with a simple SQL Server table but it was not performing well. We moved the same schema to a Cassandra ColumnFamily - the writes are super fast and its sharded but: before writing we have to read (to make sure the externalId is not assigned already) so we hit the read seek I/O limit again.

Hashing (to determine uniqueidentifier) is unfortunately not possible as we already have hundreds of millions assigned. Caching is problematic because in most cases we are assigning a 'brand new externalId' so it wouldn't be in the database at all.

Does anybody have any suggestions for the solution here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

世态炎凉 2024-12-19 02:44:12

使用 SQL Server,并像这样创建表:

  • UNIQUE Cluster on externalID
  • Guid 上的默认值 NEWID()
  • 确保ExternalID 上的唯一聚集索引具有 IGNORE_DUP_KEY = ON

使用在这种情况下,您总是执行相同的两个(超快)操作:

1 - 插入 ExternalID
2 - 查询 GUID 中的 ExternalID

它不允许重复,但不会抛出错误。由于聚集索引,它也将是高度优化的搜索。

您将需要经常重建索引,因为随着时间的推移,您将获得高度碎片(因为您聚集在非增量 varchar 上),但它应该满足您的其他要求。

Use SQL server, and create your table like so:

  • UNIQUE Cluster on ExternalID
  • Default value on Guid for NEWID()
  • Make sure the unique clustered index on ExternalID has IGNORE_DUP_KEY = ON

With this scenario, you always do the same two (super-fast) operations:

1 - Insert the ExternalID
2 - Query the GUID for the ExternalID

It won't allow duplicates, but they won't throw an error. It will also be a highly optimized seek because of the clustered index.

You will need to rebuild the index frequently because you will get a high degree of fragmentation over time (since you are clustering on a non-incremental varchar) but it should meet your other requirements.

草莓味的萝莉 2024-12-19 02:44:12

这是一个有趣的问题,老实说,超出了我的知识范围。然而,我发现它很有趣,并偶然发现了这个链接 - http://blogs.msdn.com/b/miah/archive/2008/02/17/sql-if-exists-update-else-insert.aspx

好像是这个方法跳过初始全表扫描,这应该会提高性能。这不是最好的答案,但也许您可以自定义一般思想以用于您的特定数据库实现。 (我从未听说过 Cassandra,所以看来我需要做一些研究。)即使没有,它也可能会给您一些想法(我希望)。不知道您是否已经尝试过类似的操作,但我想我会分享该链接。祝你好运。

This is an interesting problem, and honestly, beyond the scope of my knowledge. However, I found it interesting, and stumbled across this link - http://blogs.msdn.com/b/miah/archive/2008/02/17/sql-if-exists-update-else-insert.aspx

It seems like this method skips the initial full table scan, which should increase performance. This isn't the best answer, but perhaps you can customize the general idea to use for your specific DB implementation. (I've never heard of Cassandra, so it looks like I need to do some research.) Even if it doesn't, it might give you some ideas (I hope). Don't know if you have tried something like this already, but I thought I'd share the link. I wish you the best of luck.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文