缩短/重新哈希 UUID

发布于 2024-08-22 02:54:01 字数 793 浏览 3 评论 0原文

首先,我想保证我知道这样一个事实:重新哈希是一个明智的话题。不过,我想听听您的一些意见,您会采取什么方法。

我正在构建一个分布式应用程序,其中节点远程创建由 UUID 标识的实体。最终,所有实体应聚集在专用的排出节点上,该节点使用这些 UUID 存储所有实体。

现在我想创建额外的标识符,这对于人类用户来说更方便。对 UUID 进行 Base64 编码仍会创建包含 22 个字符的 ID,这不适合人类使用。所以我需要诸如 URL 缩短服务之类的东西。应用双射函数不会有帮助,因为它们不会减少信息价值。当然,我知道我需要丢失信息才能缩短 id。而且我还知道,哈希信息的任何减少都会增加冲突的可能性。 我陷入困境,减少信息以便为人类创建更短的 ID 的最合适方法是什么。

以下是一些先决条件:我将提供通过我的数据存储映射 {UUID,缩写 ID} 的功能。我仍然更喜欢非集中式解决方案。我可能永远不会需要总共超过一百万个 ID (~2^20)。

以下是我到目前为止的想法:

  • 自动递增 ID: 如果我使用某种自动递增 id,我可以将此 id 传输到混淆的字符串并传递它。这是最简单的方法,只要周围的键很少,键就不会很长。然而,我必须引入一个我并不真正想要的中心化实体。
  • 缩短 UUID: 我可以只采用原始 128 位 uuid 的一些位。那么我至少应该考虑到 UUID 的版本。或者这还有什么问题吗?
  • 重新哈希 UUID: 我可以对初始 UUID 应用第二种哈希算法并存储映射。

还有其他方法吗?什么是有利的?

提前致谢!

first of all, I want to assure that I'm aware of the fact, that rehashing is a sensible topic. However I'd like to hear some of your opinions, what approach you would take here.

I'm building a distributed application, where nodes remotely create entities identified by a UUID. Eventually, all entities should be gathered at a dedicated drain node, which stores all entities by using these UUIDs.

Now I want to create additional identifiers, which are more handy for human users. Base64-encoding the UUIDs would still create IDs with 22 characters, which is not appropriate for human usage. So I need something like URL-shortening services. Applying bijective functions will not help, because they will not reduce the information value. Of course, I'm aware that I need to lose information in order to shorten the id. And I'm also aware that any reduction of information of a hash will increase the probability of collision.
I'm stuck, what is the most appropriate way to reduce information in order to create shorter ids for humans.

Here are some prerequisites: I will provide the ability to map {UUID, shortened ID} via my data storage. I'd still prefer a non-centralized solution. I will probably never ever need more than about a milion of IDs (~2^20) in total.

Here are the thoughts I came up with so far:

  • Auto incremented IDs: If I'd use some kind of auto-incremented id, I could transfer this id to an obfuscated string and pass this around. This would be the easiest approach, and as long as there are few keys around, the keys would not be very long. However I'd have to introduce a centralized entity which I don't really want.
  • Shorten the UUID: I could just take some of the bits of the original 128 bit uuid. Then I should take at least into account the version of the UUID. Or is there anything else wrong with this?
  • Rehashing the UUID: I could apply a second hashing algorithm on my initial UUID and store the mapping.

Are there any other approaches? What is favorable?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

帅哥哥的热头脑 2024-08-29 02:54:01

1)要缩短 UUID,您可以简单地将上半部分与下半部分进行异或(然后重复,直到它足够短)。这将保留分布特征。与任何缩短输出的解决方案一样,它会增加由于生日悖论而导致冲突的可能性

2) XOR 相当于一个微不足道的散列,但由于不需要额外的混合,所以没关系。您可以在 UUID 上使用 CRC 或非加密哈希,但我不认为这有任何改进。

3) 如果您愿意接受一些中央管理,这并不一定是痛苦的。中央机构可以向每个客户端分配中等大小的地址空间块,然后客户端可以在分配 ID 时遍历该子范围。这保证了没有冲突,而且还避免了每个ID的往返。一种方法是使用 32 位整数作为 ID,一次分发一个 16 位块。换句话说,第一个客户端收到 0001,这允许 00010000 到 0001FFFF。

4) 您可以使用 UUID 插入数据库,但也可以有一个身份字段。这将提供一个替代的、更紧凑的唯一 ID,它可以限制为 32 位 int。

1) To shorten the UUID, you can simply XOR the top half with the bottom (and repeat until it's short enough for you). This will preserve the distribution characteristics. Like any solution that shortens the output, it will increase the possibility of collision due to the birthday paradox

2) XOR amounts to a trivial hash, but since no additional mixing is needed, it's fine. You could use a CRC or noncryptographic hash on your UUID, but I don't believe it's any improvement.

3) If you're willing to accept some central management, it doesn't have to be painful. A central authority can dole out medium-sized blocks of address space to each client, then the client can iterate through that subrange when assigning ID's. This guarantees that there are no collisions, but also avoids a round-trip for each ID. One way to do it would be to use a 32-bit integer for the ID, doling out a 16-bit block at a time. In other words, the first client gets handed 0001, which allows 00010000 to 0001FFFF.

4) You could insert into the database with a UUID, but also have an identity field. This would provide an alternate, more compact unique ID, which can be limited to a 32-bit int.

不语却知心 2024-08-29 02:54:01

您是否考虑过使用外部别名方法,在该方法中您选择人类友好术语的字典并使用它们使 UUID(部分)更具可读性(与诸如 What3Words):

de305d54-75b4-431b-adb2-eb6b9e546013

使用包含 65536 个单词的字典可能会变成:

de305d54-zebra-stackoverflow-extraneous-eb6b9e546013

用户不太可能看到这些人类可读名称的心理哈希冲突(斑马线出现两次)并且您的数据库大小不会增加。翻译是双射的并且纯粹是 UI。

甚至还有一个 RFC:https://datatracker.ietf.org/doc/html/ rfc1751

Have you considered using an external aliasing approach, where you pick a dictionary of human friendly terms and use them to make (parts of) the UUID more readable (compare with Geocoding systems such as What3Words):

de305d54-75b4-431b-adb2-eb6b9e546013

Using a dictionary of 65536 words could become:

de305d54-zebra-stackoverflow-extraneous-eb6b9e546013

It is unlikely that users will see mental hash collision (zebra occurring twice) with these human readable names and your database does not grow in size. The translation is bijective and purely UI.

There even is an RFC for this: https://datatracker.ietf.org/doc/html/rfc1751

如梦亦如幻 2024-08-29 02:54:01

我脑海中浮现出几件事:

您的用例是什么?如果您担心将以分布式方式生成 ID,一种解决方案是为每台计算机分配其自己唯一的 int id,并将其用作其 id 的前缀或后缀。

如果没有一个中央实体,你就意味着没有任何东西可以跟踪 id,即使是在本地,这并没有真正的帮助。您可以从 UUID 本身借用一个页面,并将系统时间与上面分配的机器 ID 结合使用。这将使您降至 64 位 + 无论您的机器 ID 大小如何。基本上,这是 UUID V1 方案,只不过您使用的机器 ID 比 MAC 地址短。如果您知道可以从 2010 年 2 月 12 日开始,您也许可以进一步缩短。

如果您还没有查看维基百科 UUID 条目,您可能会从中得到一两个关于如何构建自己的想法。

Just a couple of things that pop into mind:

What is your use case? If your concern is that you will be generating IDs in a distributed manner, one solution is to assign each machine it's own unique int id and use that as a prefix or suffix on its ids.

This doesn't really help if by not having a central entity you mean nothing that keeps track of ids even locally. You could borrow a page from UUID itself and use the system time in conjunction with the machine id assigned as above. This would get you down to 64bits + whatever size your machine id was. Basically, this is the UUID V1 scheme, except you're using something shorter than MAC address for the machine id. Given you know you can start at dates >=Feb 12, 2010, you may be able shorten even further.

Check out the wikipedia UUID entry if you haven't already, you may get an idea or two from there on how to construct your own.

維他命╮ 2024-08-29 02:54:01

这是我写的一个简单的哈希算法。您可以使用它......您可以轻松更改输入和输出映射以及散列的长度,以便权衡可读性与冲突可能性。

该算法的设计目的并不是安全或高效,但应该可以解决问题。

public class HashTools {

  final static String inputMapping = "0123456789ABCDEF";

  final static String[] outputMapping = new String[] {
      "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F", "G", "H",
      "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"
  };

  /* Input: String - containing mostly letters / numbers
   * Output: <hashLength> String using 0-9,A-Z encoding
   */
  public static String simpleHash(String str, int hashLength) {
    StringBuilder hashStr = new StringBuilder(hashLength);
    String strUpper = str.toUpperCase();
    int[] hash = new int[hashLength];

    int i, j, num;
    for (i = 0; i < strUpper.length(); i++) {
      char strChar = strUpper.charAt(i);
      num = mapCharToInt(strChar);

      j = i % hashLength;
      hash[j] += num;
    }

    for (i = 0; i < hashLength; i++) {
      hashStr.append(mapIntToHashChar(hash[i]));
    }

    return hashStr.toString();
  }

  private static int mapCharToInt(char hexChar) {
    return inputMapping.indexOf(hexChar);
  }

  private static String mapIntToHashChar(int num) {
    return outputMapping[num % outputMapping.length];
  }
}

Here is a simple hashing algorithm I wrote. You could use this... you can easily change the input and output mappings, and the length of the hash in order to trade off readability vs collision likelihood.

This algorithm is not designed to be secure or that efficient, but should do the trick.

public class HashTools {

  final static String inputMapping = "0123456789ABCDEF";

  final static String[] outputMapping = new String[] {
      "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F", "G", "H",
      "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"
  };

  /* Input: String - containing mostly letters / numbers
   * Output: <hashLength> String using 0-9,A-Z encoding
   */
  public static String simpleHash(String str, int hashLength) {
    StringBuilder hashStr = new StringBuilder(hashLength);
    String strUpper = str.toUpperCase();
    int[] hash = new int[hashLength];

    int i, j, num;
    for (i = 0; i < strUpper.length(); i++) {
      char strChar = strUpper.charAt(i);
      num = mapCharToInt(strChar);

      j = i % hashLength;
      hash[j] += num;
    }

    for (i = 0; i < hashLength; i++) {
      hashStr.append(mapIntToHashChar(hash[i]));
    }

    return hashStr.toString();
  }

  private static int mapCharToInt(char hexChar) {
    return inputMapping.indexOf(hexChar);
  }

  private static String mapIntToHashChar(int num) {
    return outputMapping[num % outputMapping.length];
  }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文