用于处理短链接冲突的最快 C# 实现

发布于 2024-10-06 04:22:24 字数 1228 浏览 0 评论 0原文

我正在为我正在构建的缩短链接的服务预先生成短代码。它不是您常用的链接缩短器,因此我无法使用现成的链接缩短器,因为我们需要每秒处理近 1000 次缩短。

我有一项服务每 12 小时运行一次,向查找表中添加另外 200,000 个短链接,以快速生成链接。

随着短链接表变得越来越长,服务所需的时间也越来越长,以至于我们无法跟上所请求的短链接的需求。

缩短链接表有 180 万行。在用完之前,我们还剩下 28 万个链接。现在生成20万个链接需要1个多小时。

我显然做错了什么,可能是因为我只使用 List 进行比较。下面是代码块:

Stopwatch sw = Stopwatch.StartNew();
LtsDataContext ldc = new LtsDataContext();
List<string> currentCodes = ldc.ShortUrls.Select(s => s.ShortCode).ToList();
currentCodes = 
    currentCodes.Union(ldc.FastShortCodes.Select(s => s.ShortCode)).ToList();

int count = args.Length > 0 ? int.Parse(args[0]) : 200000;

List<string> newCodes = new List<string>(count);

for (int i = 0; i < count; i++)
{
    string newCode = Guid.NewGuid().ToString("N").Substring(0, 8);
    while (currentCodes.Contains(newCode) || newCodes.Contains(newCode))
        newCode = Guid.NewGuid().ToString("N").Substring(0, 8);
    newCodes.Add(newCode);
}

ldc.FastShortCodes.InsertAllOnSubmit(newCodes.Select(s => 
    new FastShortCode() { ShortCode = s }));
ldc.SubmitChanges();
Console.Write((decimal)sw.ElapsedMilliseconds / (decimal)1000);
Console.ReadKey();

I am pre-generating short codes for a service I am building that shortens links. It isn't your usual link shortener, so I can't use off the shelf, as we need to handle close to 1000 shortens per second.

I have a service that runs every 12 hours to add another 200,000 shortlinks to the lookup table for fast generating the links.

As the table of shortlinks gets longer, the service takes longer and longer, to a point where we can't keep up with the demand for shortlinks being requested.

The table of shortened links is 1.8M rows. We have 280k links left before we run out. And it is taking more than 1 hour to generate 200k links now.

I am obviously doing something wrong, probably the fact that I am using just a List<string> to compare against. Below is the block of code:

Stopwatch sw = Stopwatch.StartNew();
LtsDataContext ldc = new LtsDataContext();
List<string> currentCodes = ldc.ShortUrls.Select(s => s.ShortCode).ToList();
currentCodes = 
    currentCodes.Union(ldc.FastShortCodes.Select(s => s.ShortCode)).ToList();

int count = args.Length > 0 ? int.Parse(args[0]) : 200000;

List<string> newCodes = new List<string>(count);

for (int i = 0; i < count; i++)
{
    string newCode = Guid.NewGuid().ToString("N").Substring(0, 8);
    while (currentCodes.Contains(newCode) || newCodes.Contains(newCode))
        newCode = Guid.NewGuid().ToString("N").Substring(0, 8);
    newCodes.Add(newCode);
}

ldc.FastShortCodes.InsertAllOnSubmit(newCodes.Select(s => 
    new FastShortCode() { ShortCode = s }));
ldc.SubmitChanges();
Console.Write((decimal)sw.ElapsedMilliseconds / (decimal)1000);
Console.ReadKey();

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

初心 2024-10-13 04:22:24

您的问题是,通过使用 GUID 片段,您必须检查与越来越大的表的冲突。

我假设您不想创建顺序键,因为它很容易受到 URL 探索的影响,但是您应该从顺序的东西开始,然后对其进行混淆。

编辑

根据评论者的建议,我会:

1. take a sequential key
2. shift left 8
3. add a random value between 0 and 255
4. encode base-62 (0-9,A-Z,a-z)

您将不会发生冲突,并且随机位意味着随机尝试 URL 的人每 255 次尝试中只会获得一次命中。

Your problem is that by using GUID fragments, you have to check for collisions against a larger and larger table.

I assume you don't want to create sequential keys because it's vulnerable to URL spelunking, but then you should start with something sequential and then obfuscate it.

Edit

Going with a commenter's suggestion, I would:

1. take a sequential key
2. shift left 8
3. add a random value between 0 and 255
4. encode base-62 (0-9,A-Z,a-z)

You will have no collisions, and the random bits will mean a person randomly trying URLs will only get one hit out of every 255 attempts.

坏尐絯℡ 2024-10-13 04:22:24

如果我错了,请纠正我,但看起来您正在将整个短链接表(180 万)加载到列表中,然后使用 Contains 函数搜索它,正如 @Jeff Foster 指出的那样,这是一个 O(n ) 手术。

为什么不使用更乐观的方法呢?在数据库中,向 ShortUrls/FastShortCodes 表的 ShortCode 列添加唯一约束。然后简单地生成新的短代码并尝试插入它们。如果它们未满足唯一约束(这种情况不应经常发生),则只需捕获异常并重试即可。

我也同意 @egruin 的观点,即通过使用 GUID 的子字符串,您将自己限制为只有 15 个字符(0-9,AF)。我会寻找一种至少使用所有字母数字字符(0-9,AZ)的方法,这将显着减少您遇到的冲突次数。

Correct me if I'm wrong, but it looks like you're loading your entire table of short links (1.8 million) into a List and then searching it with a Contains function, which as @Jeff Foster pointed out is an O(n) operation.

Why not use a more optimistic method? In your database, add a unique constraint to the ShortCode column of the ShortUrls/FastShortCodes table. Then simply generate new short codes and attempt to insert them. If they fail the unique constraint (which shouldn't happen too often) then just catch the exception and try again.

I also agree with @egruin that by using a substring of a GUID you are limiting yourself to only fifteen characters (0-9, A-F). I would look for a way to use at least all alphanumeric characters (0-9, A-Z) which would significantly reduce the number of collisions you encounter.

财迷小姐 2024-10-13 04:22:24

为什么不对 currentCodesnewCodes 使用 DictionaryListContains 方法需要遍历所有列表条目 O(n),而 Dictionary 的运行时间复杂度为 O(1 )。

编辑1
如果您无论如何都将所有链接保存在数据库中,为什么还需要 Guid?为什么不在链接中简单地使用数据库的主键呢?

编辑2
由于代码已经存在的可能性非常低,您可以尝试插入并捕获异常,然后再次尝试(乐观插入)。

Why don't you use a Dictionary for currentCodes and newCodes? The Contains method of a List needs to traverse all list entries O(n), in contrast to a Dictionary, which runs in O(1).

EDIT1
IF you are saving all your links in the database anyway, why do you need a Guid? Why don't you simply use the primary key of the database in your links?

EDIT2
Since the likelyhood that a code already exists is very low you could try an insert and catch the exception and then try it again (optimistic inserting).

木有鱼丸 2024-10-13 04:22:24

这是您在这一行中的问题:

currentCodes.Contains(newCode) || newCodes.Contains(newCode)

随着描述问题的每次添加,这将需要越来越多的内容。我怀疑您是否对此的数学复杂性感兴趣,但这将接近 (N*N)/2。

解决方案是使用HashTable

Here is your problem in this line:

currentCodes.Contains(newCode) || newCodes.Contains(newCode)

This will take more and more with each addition which describes the problem. I doubt if you are interested in mathematical complexity of this, but this would be close to (N*N)/2.

Solution is to use HashTable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文