从姓名和地址数据创建 ID。 散列/摘要

发布于 2024-07-09 07:55:48 字数 763 浏览 10 评论 0原文

我的问题:

我正在寻找一种方法将一个人的姓名和地址表示为编码的 ID。 id 应仅包含字母数字字符、防冲突并以尽可能少的字符数表示。 我的第一个想法是简单地使用 MD5 或 SHA1 等加密哈希函数,但这似乎有点矫枉过正(安全性并不重要 - 不需要是单向的),而且我更愿意找到能够产生较短的 ID。 有谁知道适合这个问题的现有算法?

换句话说,实现以下函数的最佳方法是什么,以便对于相同的输入,返回值始终相同,不太可能发生冲突,并且 id 小于 20 个字符?

>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
N1743123734

应用程序上下文(对于感兴趣的人):

这将用于记录链接应用程序。 给定输入名称和地址,我们在一个非常大的数据库中搜索最佳匹配,并返回数据库 ID 和其他数据(我们如何做到这一点在这里并不重要)。 如果没有匹配项,我需要从搜索输入(实体的名称和地址数据)生成此伪/生成/派生 id。 每个搜索记录都应该产生一个输出记录,该记录具有真实的(匹配/链接产生的实际数据库 ID)或生成的伪/生成/派生 ID。 伪 ID 将带有一个字符(例如 N)作为前缀,以将其与真实 ID 区分开。

My problem:

I'm looking for a way to represent a person's name and address as an encoded id. The id should contain only alpha-numeric characters, be collision-proof, and be represented in a smallest number of characters possible. My first thought was to simply use a cryptographic hash function like MD5 or SHA1, but this seems like overkill (security isn't important - doesn't need to be one-way) and I'd prefer to find something that would produce a shorter id. Does anyone know of an existing algorithm that fits this problem?

In other words, what is the best way to implement the following function so that the return value is the same consistently for the same input, collisions are unlikely, and ids are less than 20 characters?

>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
N1743123734

Application Context (for those that are interested):

This will be used for a record linkage app. Given an input name and address we search a very large database for the best match and return the database id and other data (how we do this is not important here). If there isn't a match I need to generate this psuedo/generated/derived id from the search input (entity's name and address data). Every search record should result in an output record with either a real (the actual database id resulting from a match/link) or this generated psuedo/generated/derived id. The psuedo id will be prefixed with a character (e.g. N) to differentiate it from a real id.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

北陌 2024-07-16 07:55:48

我知道您拒绝 MD5 和 SHA1,但我认为您无论如何都应该考虑它们。 除了经过充分研究的哈希算法之外,长度还可以为您提供更多保护,防止可能的冲突。 没有哈希值是防冲突的,但加密哈希值通常比您自己想出的哈希值更不容易发生冲突。

I know you said no to MD5 and SHA1, but I think you should consider them anyway. As well as being well studied hashing algorithms, the length gives you more protection against possible collisions. No hash is collision-proof, but the cryptographic ones generally are less collision-prone than something you couuld come up with yourself.

红尘作伴 2024-07-16 07:55:48
  • 使用加密哈希来实现其抗冲突性,而不是其其他特性
  • 使用尽可能多的字节根据需要将哈希值(截断)
  • 转换为字母数字字符
  • 您还可以截断字母数字字符串而不是哈希值

一种简单的方法:对数据进行哈希处理,以 Base64 进行编码,删除所有非字母数字字符, 截断。

N_HASH_CHARS = 11
import hashlib, re
def digest(name, address):
  hash = hashlib.md5(name + "|" + address).digest().encode("base64")
  alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
  return alnum_hash[:N_HASH_CHARS]

您应该保留多少个字母数字字符? 每个字符提供大约 5.95 位的熵 (log(62,2))。 11 个字符为您提供 65.5 位熵,这应该足以避免前 2**32.7 个用户(约 70 亿)的冲突。

  • Use a cryptographic hash for its collision resistance, not its other qualities
  • Use as many bytes from the hash as you want (truncate)
  • convert to alpha-numeric characters
  • You can also truncate the alpha-numeric string instead of the hash

An easy way to do this: hash the data, encode in base64, remove all non-alpha-numeric characters, truncate.

N_HASH_CHARS = 11
import hashlib, re
def digest(name, address):
  hash = hashlib.md5(name + "|" + address).digest().encode("base64")
  alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
  return alnum_hash[:N_HASH_CHARS]

How many alpha-numeric characters should you keep? Each character gives you around 5.95 bits of entropy (log(62,2)). 11 characters give you 65.5 bits of entropy, which should be enough to avoid a collision for the first 2**32.7 users (about 7 billion).

仙女 2024-07-16 07:55:48

一个好的解决方案在某种程度上取决于您的应用程序。 您知道有多少用户以及所有用户的集合是什么吗? 如果您提供更多详细信息,您将获得更好的帮助。

A good solution is somewhat dependent on your application. Do you know how many users and what the set of all users is? If you provide more details you would get better help.

国际总奸 2024-07-16 07:55:48

我同意另一张海报建议的序列号。 OTOH,如果您真的真的想做其他事情:

根据数据创建 SHA1 哈希,并将其存储在带有序列号字段的表中。

然后,当你获取数据时,计算哈希值,在表上查找它,获取序列号,这就是你的 id。 如果桌子上没有,请将其插入。

I agree with the other poster suggesting serial numbers. OTOH, if you really, really really want to do something else:

Create a SHA1 hash from the data, and store it in a table with a serial number field.

Then, when you get the data, calculate the hash, look it up on the table, get the serial, and that's your id. If it's not on the table, insert it.

ゞ花落谁相伴 2024-07-16 07:55:48

我想知道你是否打算将这些ID“分配”给用户? 如果是这样,我希望你的用户会讨厌你提出的任何建议; 谁想要用户 ID“AAAAA01”?

因此,如果这些 id 对用户可见,那么您应该让他们选择他们喜欢的内容并检查它们的唯一性(简单)。 如果它们对用户不可见(例如,内部主键),则只需使用适当的技术(例如 Oracle 序列或 SQL Server 自动编号(也很简单))按顺序生成它们。

如果这些 id 试图检测多次注册的用户,那么我同意您应该考虑加密哈希,然后对注册数据(名称、地址等)进行完整比较。 但是,为了可用,您需要在计算哈希或进行比较之前将数据转换为规范形式(标准化字母大小写、空格、规范街道地址等)。 否则,您将因微小的差异而导致不匹配。

编辑:现在我根据您的编辑更好地了解了问题空间,我认为您的算法(到目前为止)不太可能捕获大多数匹配项。 除了我对输入进行规范化的建议之外,我建议您考虑一种方法,该方法会产生少数可能匹配的排名列表(如果可能,由人类解决),而不是对单个匹配进行全有或全无的尝试。 换句话说,我推荐使用搜索方法而不是查找方法。

这在你的情况下可行吗?

I wonder whether you intend to "assign" these ids to the users? If so, I would expect your users to hate anything that you propose; who would want a user id of "AAAAA01"?

So, if these ids are visible to the user, then you should just let them pick what they like and check them for uniqueness (easy). If they are not visible to the user (e.g., internal primary key), then just generate them sequentially using an appropriate technique such as an Oracle Sequence or SQL Server AutoNumber (also easy).

If these ids are an attempt to detect a user that is registering more than once, then I would agree that you should consider a cryptographic hash followed by a full comparison of the registration data (name, address, etc.). However, to be usable, you will need to translate the data into a canonical form (standardized letter case, whitespace, canonical street address, etc.) before computing the hash or making the comparison. Otherwise, you will mismatch based on trivial differences.

EDIT: Now that I understand the problem space better based on your edits, I think that it is highly unlikely that your algorithm (so far) will catch most matches. Beyond my suggestion to canonicalize the inputs, I recommend that you consider an approach that results in a ranked list of a handful of possible matches (to be resolved by a human if possible) rather than an all-or-nothing attempt at a single match. In other words, I recommend a search approach rather than a lookup approach.

Is that feasible in your situation?

听风吹 2024-07-16 07:55:48

好吧,如果同一地址有多个同名的人,那么您就完蛋了(无需添加代码来检测这一点并添加某种鉴别器)。

但假设问题不存在,那么完整地址的街道地址和邮政编码部分足以保证那里的唯一性,因此从名称中添加足够的数据应该可以解决这个问题...

您是否有权访问数据库,或其他持久性机制,您可以在其中生成和维护每个地址的键值? 然后将地址和各个实体保存在两个键控字典结构中,其中为每个新的不同地址、遇到的人自动生成密钥...然后使用自动生成的字母数字密钥...

You could use AAAAA01  for first person at first address,
              AAAAA02 for second person at first address,
              AAAAB07 for the seventh resident at the second adresss, etc.

如果您没有任何方法来生成和维护这些实体键映射,那么您需要使用完整的街道地址/邮政编码和完整名称,或相同的哈希值,尽管哈希值方法生成重复项的机会很小......

Well, if there's more than one person at the same address with the same name, you're toast here, (w/o adding code to detect this and add a discriminator of some kind).

but assuming that issue is not, then the street address and zip code portion of the full addresss is sufficient to guaranteee uniqueness there, so adding enough data from the name should take care of the issue...

Do you have access to a database, or other persistence mechanism, where you could generate and maintain key values for each address? Then keep the address and individual entities in two keyed dictionary structures, where the key is autogenerated for each new distinct address, person encountered... and then use the autogenerated alpha-numeric key...

You could use AAAAA01  for first person at first address,
              AAAAA02 for second person at first address,
              AAAAB07 for the seventh resident at the second adresss, etc.

If you donlt have any way to generate and maintain these entity-Key mappings then you need to use the full street address/Zip and fullNAme, or a hash value of the same, although the Hash value approach has a smnall chance of generating duplicates...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文