有一个小问题:
- 将 uniqueidentifier 分配给任何 externalId
- 分配后不要覆盖 uniqueidentifier - 只需返回现有的 uniqueidentifier
想象一个表
ExternalId | Guid
--------------------------------
some1 | accf-0334-dfdf-....
现在,问题在于规模。我们希望像这样映射数十亿个 externalId,并且我们需要能够快速分配这些标识符(数千/秒)。
我们从一个简单的 SQL Server 表开始,但它的性能不佳。我们将相同的模式移至 Cassandra ColumnFamily - 写入速度超快且分片,但是:在写入之前我们必须读取(以确保 externalId 尚未分配),因此我们再次达到读取查找 I/O 限制。
不幸的是,散列(以确定唯一标识符)是不可能的,因为我们已经分配了数亿个。缓存是有问题的,因为在大多数情况下,我们分配一个“全新的externalId”,因此它根本不会出现在数据库中。
有人对这里的解决方案有什么建议吗?
There is a trivial problem:
- assign uniqueidentifier to any externalId
- do not overwrite the uniqueidentifier once it is assigned - just return existing uniqueidentifier
Imagine a table
ExternalId | Guid
--------------------------------
some1 | accf-0334-dfdf-....
Now, the twist is the scale. We want billions of externalIds to be mapped like this and we need to be able to assign these identifiers fast (thousands/sec)
We started of with a simple SQL Server table but it was not performing well. We moved the same schema to a Cassandra ColumnFamily - the writes are super fast and its sharded but: before writing we have to read (to make sure the externalId is not assigned already) so we hit the read seek I/O limit again.
Hashing (to determine uniqueidentifier) is unfortunately not possible as we already have hundreds of millions assigned. Caching is problematic because in most cases we are assigning a 'brand new externalId' so it wouldn't be in the database at all.
Does anybody have any suggestions for the solution here?
发布评论
评论(2)
使用 SQL Server,并像这样创建表:
IGNORE_DUP_KEY = ON
使用在这种情况下,您总是执行相同的两个(超快)操作:
1 - 插入
ExternalID
2 - 查询
GUID
中的ExternalID
它不允许重复,但不会抛出错误。由于聚集索引,它也将是高度优化的搜索。
您将需要经常重建索引,因为随着时间的推移,您将获得高度碎片(因为您聚集在非增量 varchar 上),但它应该满足您的其他要求。
Use SQL server, and create your table like so:
IGNORE_DUP_KEY = ON
With this scenario, you always do the same two (super-fast) operations:
1 - Insert the
ExternalID
2 - Query the
GUID
for theExternalID
It won't allow duplicates, but they won't throw an error. It will also be a highly optimized seek because of the clustered index.
You will need to rebuild the index frequently because you will get a high degree of fragmentation over time (since you are clustering on a non-incremental varchar) but it should meet your other requirements.
这是一个有趣的问题,老实说,超出了我的知识范围。然而,我发现它很有趣,并偶然发现了这个链接 - http://blogs.msdn.com/b/miah/archive/2008/02/17/sql-if-exists-update-else-insert.aspx
好像是这个方法跳过初始全表扫描,这应该会提高性能。这不是最好的答案,但也许您可以自定义一般思想以用于您的特定数据库实现。 (我从未听说过 Cassandra,所以看来我需要做一些研究。)即使没有,它也可能会给您一些想法(我希望)。不知道您是否已经尝试过类似的操作,但我想我会分享该链接。祝你好运。
This is an interesting problem, and honestly, beyond the scope of my knowledge. However, I found it interesting, and stumbled across this link - http://blogs.msdn.com/b/miah/archive/2008/02/17/sql-if-exists-update-else-insert.aspx
It seems like this method skips the initial full table scan, which should increase performance. This isn't the best answer, but perhaps you can customize the general idea to use for your specific DB implementation. (I've never heard of Cassandra, so it looks like I need to do some research.) Even if it doesn't, it might give you some ideas (I hope). Don't know if you have tried something like this already, but I thought I'd share the link. I wish you the best of luck.