对字符串进行排序，以使相邻字符串之间的汉明距离较小

发布于 2024-12-23 02:34:30 字数 1164 浏览 1 评论 0原文

问题：

我有 N (~100k-1m) 个字符串，每个字符串长度为 D（例如 2000）个字符，并且字母表较小（例如 3 个可能的字符）。我想对这些字符串进行排序，以使相邻字符串之间的可能变化尽可能少（例如汉明距离较低）。解决方案不一定是最好的，但越接近越好。

示例

N=4
D=5
//initial strings
1. aaacb
2. bacba
3. acacb
4. cbcba

//sorted so that hamming distance between adjacent strings is low
1. aaacb
3. acacb (Hamming distance 1->3 = 1)
4. cbcba (Hamming distance 3->4 = 4)
2. bacba (Hamming distance 4->2 = 2)

对问题的思考

我有一种不好的预感，这是一个不平凡的问题。如果我们将每个字符串视为一个节点，并将到其他字符串的距离视为一条边，那么我们正在考虑旅行商问题。大量的字符串意味着预先计算所有成对距离可能是不可行的，我认为将问题变成更像加拿大旅行者问题。

目前我的解决方案是使用 VP 树来查找贪婪的最近邻居类型问题的解决方案

curr_string = a randomly chosen string from full set
while(tree not empty)
    found_string = find nearest string in tree
    tree.remove(found_string)
    sorted_list.add(curr_string)
    curr_string = found_string

，但初步结果似乎很差。对字符串进行哈希处理以使更多相似的字符串更接近可能是另一种选择，但我对这将提供的解决方案有多好以及它将如何扩展到这种大小的数据知之甚少。

原文

Problem:

I have N (~100k-1m) strings each D (e.g. 2000) characters long and with a low alphabet (eg 3 possible characters). I would like to sort these strings such that there are as few possible changes between adjacent strings (eg hamming distance is low). Solution doesn't have to be the best possible but closer the better.

Example

N=4
D=5
//initial strings
1. aaacb
2. bacba
3. acacb
4. cbcba

//sorted so that hamming distance between adjacent strings is low
1. aaacb
3. acacb (Hamming distance 1->3 = 1)
4. cbcba (Hamming distance 3->4 = 4)
2. bacba (Hamming distance 4->2 = 2)

Thoughts about the problem

I have a bad feeling this is a non trivial problem. If we think of each string as a node and the distances to other strings as an edge, then we are looking at a travelling salesman problem. The large number of strings means that calculating all of the pairwise distances beforehand is potentially infeasible, I think turning the problem into some more like the Canadian Traveller Problem.

At the moment my solution has been to use a VP tree to find a greedy nearest neighbour type solution to the problem

curr_string = a randomly chosen string from full set
while(tree not empty)
    found_string = find nearest string in tree
    tree.remove(found_string)
    sorted_list.add(curr_string)
    curr_string = found_string

but initial results appear to be poor. Hashing strings so that more similar ones are closer may be another option but I know little about how good a solution this will provide or how well it will scale to data of this size.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羁拥 2024-12-30 02:34:30

即使您认为这个问题类似于旅行商问题（TSP），我相信汉明距离将遵循三角不等式（汉明（A，B）+汉明（B，C）≤汉明（A，C）），因此，您实际上只是在处理 ΔTSP（公制旅行商问题），对此有许多算法可以在理想结果下给出良好的近似值。特别是，Christofides 算法始终会为您提供最多 1.5 倍最小可能长度的路径。

回复收藏 0 原文