字符串重复子序列和压缩
我想做某种“搜索和替换”算法,如果可能的话,它将以有效的方式识别字符串中多次出现的子字符串,并用标记替换该子字符串的所有出现。
例如,给定一个字符串“AbcAdAefgAbijkAblmnAbAb”,请注意“A”重复出现,因此将第一个传递减少为“#1bc#1d#1efg#1bijk#1blmn#1b#1b”,其中#_是索引模式(我们注意到索引表中的模式),然后注意“#1b”重复出现,因此减少为“#2c#1d#1efg#2ijk#2lmn#2#2”。字符串中不再出现任何模式,因此我们完成了。
我找到了一些有关“最长公共子序列”和压缩算法的信息,但似乎没有任何作用。它们要么用于比较两个字符串,要么用于获得某种存储最佳结果。
另一方面,我的目标是将基因组简化为“单词”而不是“字母”。即,我想看到 2c1c2c,而不是 gatcatcgatc。之后我可以做一些正则表达式来查找诸如“#42*#42”之类的东西;如果能在 DNA 中看到重复出现的括号,那就太酷了。
如果我能在网上找到这个,我会跳过自己做,但我之前看不到这个问题的答案,我可以发现。对于任何能给我指出正确方向的人,非常感谢。
I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.
For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.
I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.
My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.
If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
字节对编码的作用与您想要的非常接近。
而不是直接搜索最长的重复字符串(自上而下),
每一次字节对编码都会搜索重复的字节对(自下而上)。
但最终它发现了最长的重复字符串(*)。
正如你所看到的,它找到了最长的重复字符串“gatc”。
(*) 字节对编码最终找到最长的重复字符串,
否则它会在进行 (2^8 - uniquechars(source) ) 替换后提前停止。
我怀疑可以调整字节对编码,以便提前停止条件稍微放松——也许是 (2^9 - uniquechars(source) ) 或 2^12 或 2^16。
即使这会损害压缩性能,也许它会给像您这样的应用程序带来有趣的结果。
The byte pair encoding does something pretty close to what you want.
Rather than searching directly for the longest repeated string (top-down),
each pass of byte pair encoding searches for repeated byte pairs (bottom-up).
But eventually it discovers the longest repeated string(*).
As you can see, it has found the longest repeated string "gatc".
(*) byte pair encoding either eventually finds the longest repeated string,
or else it stops early after making (2^8 - uniquechars(source) ) substitutions.
I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16.
Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.