如何检测字符串列表中的重复项?
我有一系列 SQL 调用,我想用它来检测循环(因此不必要的重复 sql 调用),但这让我想到了这个更普遍的问题。
给定一个列表,说 [a,b,c,b,c,a,b,c,b,c,a,b,b]
有什么办法可以把它变成 a,[[b,c]*2,a]*2,b*2
或,[a,[b,c]*2]*2,a,b*2
也就是说,检测重复(可能是嵌套的)。
I have a sequence of SQL calls, which I want to use to detect loops (and hence unnecessary duplicate sql calls), but it got me thinking of this more general problem.
Given a list, say[a,b,c,b,c,a,b,c,b,c,a,b,b]
Is there some way I can turn that intoa,[[b,c]*2,a]*2,b*2
or, [a,[b,c]*2]*2,a,b*2
That is, detect repetitions (possibly nested ones).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
查看Lempel-Ziv-Welsh 压缩算法。 它建立在检测字符串中的重复并利用它们进行压缩的基础上。 我相信你可以使用 Trie
为了它。
Look into the Lempel-Ziv-Welsh compression algorithm. It is built on detecting repetitions in strings and utilizing them for compression. I believe you can use a Trie
for it.
我不是该领域的专家,但您可能想检查一些压缩算法,在我看来,这正是他们所做的。
I’m no expert in that field but you might want to check out some compression algorithms, it seems to me that this is quite exactly what they do.
如果您可以先对其进行排序,那么很容易再进行一次查找重复的运行。 当然,对像 SQL 查询这样自由格式的东西进行排序听起来有点可怕。
If you can sort it first, then it's easy to go through one more time to find duplicate runs. Of course, sorting something as free-form as SQL queries sounds a bit scary.
如果字符串足够大,一个有趣的方法是对其运行压缩工具(如 gzip、bzip 或 7zip)。 这些工具的工作原理是定位重复项(在各个级别),并用指向文本(或字典)第一个实例的指针替换它们。 您实现的压缩是重复次数的衡量标准。 转储文件(您必须编写代码来执行此操作)将为您提供重复的内容。
If the string is sufficiently large, an interesting approach is to run a compression tool (like gzip, bzip, or 7zip) on it. These tools work by locating repetitions (at various levels), and substituting them by pointers to the first instance of the text (or a dictionary). The compression you achieve is a measure of the repetition. Dumping the file (you'll have to write code to do that) will give you the repeated content.