当前位置：文江博客话题详情

我需要一个函数，给定相似的输入返回相似的索引

发布于 2024-12-09 11:38:59 字数 264 浏览 0 评论 0原文

所以我研究了哈希函数，发现给定两个相似的字符串，即使有一点不同，结果也将是一个完全不同的哈希键。我实际上需要创建某种唯一的 id，它具有对于相似输入非常相似的功能（将是数百万个字母数字字符串）。

示例：

两个相等的字符串必须具有相同的哈希值。
两个不同的字符串必须具有不同的哈希值。
两个非常相似的不同字符串必须具有不同的哈希值，同时彼此相差不太远。

实现这一目标的好方法是什么？我正在使用Python。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

万劫不复 2024-12-16 11:38:59

你所要求的是不可能的，假设“相似散列”你的意思是这些值应该具有相似的大小 - 例如，12345 类似于 12346 但不类似于 92345。这样做的原因是这种相似性是一维（数轴），但字符串彼此相似的方式没有固定的维度（例如，“foo”、“fob”和“fod”都有距离1 彼此）。

如果您想执行模糊匹配，则需要使用不同的方法对文本进行索引，例如这个或这个。

如果您只想比较各个值的相似性，那么首先不要对它们进行哈希处理 - 只需立即计算它们的编辑距离即可。

回复收藏 0 原文

清秋悲枫 2024-12-16 11:38:59

如果您确定始终拥有字母数字数据，那么我建议您使用基数 36（或更高）的算法。

您可以使用我给出的方法作为此问题的答案：Base 62 conversion< /a>

import string
BASE_LIST = string.digits + string.letters
BASE_DICT = dict((c, i) for i, c in enumerate(BASE_LIST))

def base_decode(string, reverse_base=BASE_DICT):
    length = len(reverse_base)
    ret = 0
    for i, c in enumerate(string[::-1]):
        ret += (length ** i) * reverse_base[c]

    return ret

def base_encode(integer, base=BASE_LIST):
    length = len(base)
    ret = ''
    while integer != 0:
        ret = base[integer % length] + ret
        integer /= length

    return ret

用法示例：

for i in range(100):                                    
    print i, base_decode(base_encode(i)), base_encode(i)

If you're sure that you always have alphanumeric data than I would recommend using a base 36 (or higher) algorithm.

You can use the method I gave as an answer to this question: Base 62 conversion

import string
BASE_LIST = string.digits + string.letters
BASE_DICT = dict((c, i) for i, c in enumerate(BASE_LIST))

def base_decode(string, reverse_base=BASE_DICT):
    length = len(reverse_base)
    ret = 0
    for i, c in enumerate(string[::-1]):
        ret += (length ** i) * reverse_base[c]

    return ret

def base_encode(integer, base=BASE_LIST):
    length = len(base)
    ret = ''
    while integer != 0:
        ret = base[integer % length] + ret
        integer /= length

    return ret

Example usage:

for i in range(100):                                    
    print i, base_decode(base_encode(i)), base_encode(i)

回复收藏 0 原文

旧伤还要旧人安 2024-12-16 11:38:59

我相信以下内容可以满足您的要求。

def gethash(data):
  u"given a character string return an integer hash value"
  return reduce(lambda b1, b2: (b1 << 8) + b2,
      imap(ord, unicodedata.normalize('NFC', data).encode('UTF-8')))

本质上，哈希值是输入的 UTF-8 编码字节值作为单个整数的完整二进制值。相似的字符串会产生具有相似位的哈希值（并不总是具有小的减法差异，但您没有指定这一点）。规范化会导致字符串 u'A\u030a' 和 u'\xc5' 具有相同的哈希值。

如果您想限制最大值，则只需应用模除法（可能除以 2^32）作为最后一步。

I believe the below satisfies your stated requirements.

def gethash(data):
  u"given a character string return an integer hash value"
  return reduce(lambda b1, b2: (b1 << 8) + b2,
      imap(ord, unicodedata.normalize('NFC', data).encode('UTF-8')))

Essentially the hash value is the complete binary value of the UTF-8 encoded byte values of the input as a single integer. Similar character strings produce hash values with similar bits (not always with a small subtractive difference, but you did not specify that). Normalization causes strings u'A\u030a' and u'\xc5' to have the same hash value.

If you want to limit the maximum value, then simply apply modulo division (by 2^32 maybe) as a final step.

回复收藏 0 原文

~没有更多了~