我需要一个函数,给定相似的输入返回相似的索引
所以我研究了哈希函数,发现给定两个相似的字符串,即使有一点不同,结果也将是一个完全不同的哈希键。我实际上需要创建某种唯一的 id,它具有对于相似输入非常相似的功能(将是数百万个字母数字字符串)。
示例:
- 两个相等的字符串必须具有相同的哈希值。
- 两个不同的字符串必须具有不同的哈希值。
- 两个非常相似的不同字符串必须具有不同的哈希值,同时彼此相差不太远。
实现这一目标的好方法是什么?我正在使用Python。
So I was looking at the hash functions, and figured out that given 2 similar strings, even if the differ by a single bit, the result would be a completely different hash key. I actually need to create some sort of unique id, which has this feature of being quite similar for similar input (will be millions of alpha numerical strings).
Example:
- two equal strings must have the same hash.
- two different strings must have different hash.
- two different strings, that are quite similar must have different hashes that at the same time are not too far from each other.
what would be a good approach to achieve that? I am using python.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你所要求的是不可能的,假设“相似散列”你的意思是这些值应该具有相似的大小 - 例如,12345 类似于 12346 但不类似于 92345。这样做的原因是这种相似性是一维(数轴),但字符串彼此相似的方式没有固定的维度(例如,“foo”、“fob”和“fod”都有距离1 彼此)。
如果您想执行模糊匹配,则需要使用不同的方法对文本进行索引,例如 这个 或 这个。
如果您只想比较各个值的相似性,那么首先不要对它们进行哈希处理 - 只需立即计算它们的编辑距离即可。
What you're asking for is not possible, assuming by 'similar hash' you mean that the values should be of similar magnitude - eg, 12345 is similar to 12346 but not to 92345. The reason for this is that similarity of that sort is one dimensional (a number line), but the ways in which strings can be similar to each other has no fixed dimension (eg, 'foo', 'fob' and 'fod' all have distance 1 to each other).
If you want to perform fuzzy matching, you will instead need to use a different method of indexing your text, like this or this.
If you just want to compare individual values for similarity, don't hash them in the first place - just compute their edit distance immediately.
如果您确定始终拥有字母数字数据,那么我建议您使用基数 36(或更高)的算法。
您可以使用我给出的方法作为此问题的答案:Base 62 conversion< /a>
用法示例:
If you're sure that you always have alphanumeric data than I would recommend using a base 36 (or higher) algorithm.
You can use the method I gave as an answer to this question: Base 62 conversion
Example usage:
我相信以下内容可以满足您的要求。
本质上,哈希值是输入的 UTF-8 编码字节值作为单个整数的完整二进制值。相似的字符串会产生具有相似位的哈希值(并不总是具有小的减法差异,但您没有指定这一点)。规范化会导致字符串
u'A\u030a'
和u'\xc5'
具有相同的哈希值。如果您想限制最大值,则只需应用模除法(可能除以 2^32)作为最后一步。
I believe the below satisfies your stated requirements.
Essentially the hash value is the complete binary value of the UTF-8 encoded byte values of the input as a single integer. Similar character strings produce hash values with similar bits (not always with a small subtractive difference, but you did not specify that). Normalization causes strings
u'A\u030a'
andu'\xc5'
to have the same hash value.If you want to limit the maximum value, then simply apply modulo division (by 2^32 maybe) as a final step.