对字符串进行排序,以使相邻字符串之间的汉明距离较小
问题:
我有 N (~100k-1m) 个字符串,每个字符串长度为 D(例如 2000)个字符,并且字母表较小(例如 3 个可能的字符)。我想对这些字符串进行排序,以使相邻字符串之间的可能变化尽可能少(例如汉明距离较低)。解决方案不一定是最好的,但越接近越好。
示例
N=4
D=5
//initial strings
1. aaacb
2. bacba
3. acacb
4. cbcba
//sorted so that hamming distance between adjacent strings is low
1. aaacb
3. acacb (Hamming distance 1->3 = 1)
4. cbcba (Hamming distance 3->4 = 4)
2. bacba (Hamming distance 4->2 = 2)
对问题的思考
我有一种不好的预感,这是一个不平凡的问题。如果我们将每个字符串视为一个节点,并将到其他字符串的距离视为一条边,那么我们正在考虑旅行商问题。大量的字符串意味着预先计算所有成对距离可能是不可行的,我认为将问题变成更像 加拿大旅行者问题。
目前我的解决方案是使用 VP 树 来查找贪婪的最近邻居类型问题的解决方案
curr_string = a randomly chosen string from full set
while(tree not empty)
found_string = find nearest string in tree
tree.remove(found_string)
sorted_list.add(curr_string)
curr_string = found_string
,但初步结果似乎很差。对字符串进行哈希处理以使更多相似的字符串更接近可能是另一种选择,但我对这将提供的解决方案有多好以及它将如何扩展到这种大小的数据知之甚少。
Problem:
I have N (~100k-1m) strings each D (e.g. 2000) characters long and with a low alphabet (eg 3 possible characters). I would like to sort these strings such that there are as few possible changes between adjacent strings (eg hamming distance is low). Solution doesn't have to be the best possible but closer the better.
Example
N=4
D=5
//initial strings
1. aaacb
2. bacba
3. acacb
4. cbcba
//sorted so that hamming distance between adjacent strings is low
1. aaacb
3. acacb (Hamming distance 1->3 = 1)
4. cbcba (Hamming distance 3->4 = 4)
2. bacba (Hamming distance 4->2 = 2)
Thoughts about the problem
I have a bad feeling this is a non trivial problem. If we think of each string as a node and the distances to other strings as an edge, then we are looking at a travelling salesman problem. The large number of strings means that calculating all of the pairwise distances beforehand is potentially infeasible, I think turning the problem into some more like the Canadian Traveller Problem.
At the moment my solution has been to use a VP tree to find a greedy nearest neighbour type solution to the problem
curr_string = a randomly chosen string from full set
while(tree not empty)
found_string = find nearest string in tree
tree.remove(found_string)
sorted_list.add(curr_string)
curr_string = found_string
but initial results appear to be poor. Hashing strings so that more similar ones are closer may be another option but I know little about how good a solution this will provide or how well it will scale to data of this size.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
即使您认为这个问题类似于旅行商问题(TSP),我相信汉明距离将遵循三角不等式(汉明(A,B)+汉明(B,C)≤汉明(A,C)),因此,您实际上只是在处理 ΔTSP(公制旅行商问题),对此有许多算法可以在理想结果下给出良好的近似值。特别是,Christofides 算法 始终会为您提供最多 1.5 倍最小可能长度的路径。
Even if you consider this problem as similar to the travelling salesman problem (TSP), I believe that Hamming distances will follow the triangle inequality (Hamming(A,B) + Hamming(B,C) ≤ Hamming(A,C)), so you're only really dealing with ∆TSP (the metric travelling salesman problem), for which there are a number of algorithms which give good approximations at an ideal result. In particular, the Christofides algorithm will always give you a path of at most 1.5x the minimum possible length.
是的,这是一个旅行推销员问题,
但我不知道下面的十几个程序中是否有一个
TSP源代码库
可以通过插件指标直接获得 1M 点。
可能的两阶段方法:
1)将 1M 点分成 50 个簇
与一个
最近邻居搜索。
对50个集群中心进行TSP。
2) 将所有 1M - 50 个点放在最近的 2 个中心之间;
每串 1M/50 做 TSP。
这里的“50”可以是 100 或 1000。
如果 1000 太大,则递归:将 1000 分成 30 个簇,每个簇约 30 个。
K-means可以聚类1M个点,
但我再次不知道插件指标的快速实现。
不过请参阅
scikit-learn 聚类
要找到 N 个点的质心,
最小化|中心-所有其他|的一个,
你只能通过以下方式击败 O(N^2)
取 sqrt(N) 的随机样本的最佳值 --
应该足够好了。 (或谷歌/询问有关快速近似质心的单独问题)。
首先将数据紧密打包,以节省整个流程中的内存访问。
在本例中,将 abc 编码为 00 01 10
(每对之间的汉明距离 = 1):
2000 x 2 位 = 500 字节。
Fwiw,在我的 mac ppc 上找到最小 Hammingdist(4k 位,10k x 4k)大约需要 40 毫秒。
Yes this is a Traveling salesman problem,
but I don't know if any of the dozen programs under
TSP source code library
can do 1M points straight up, with a plug-in metric.
A possible 2-stage approach:
1) split the 1M points into 50 clusters
with a
Nearest neighbor search.
Do TSP on the 50 cluster centres.
2) put all the 1M - 50 points between the 2 nearest centres;
do TSP on each string of 1M / 50.
Here "50" could be 100 or 1000.
If 1000 is too big, recurse: split 1000 into 30 clusters of ~ 30 each.
K-means can cluster 1M points,
but again I don't know of a fast implementation with plug-in metric.
See however
scikit-learn clustering
To find a centroid of N points,
one which minimizes |centre - all others|,
you can afaik beat O(N^2) only by
taking the best of a random sample of say sqrt(N) --
should be good enough. (Or google / ask a separate question on fast approximate centroid).
First pack the data tightly to save memory accesses in the whole flow.
In this case, encode a b c as 00 01 10
(Hamming distance beween each pair = 1):
2000 x 2 bits = 500 bytes.
Fwiw, finding min Hammingdist( 4k bits, 10k x 4k ) takes ~ 40 msec on my mac ppc.