确定一个序列是否在另一个序列中的最佳方法?
这是“字符串包含子字符串”问题到(更多)任意类型的概括。
给定一个序列(例如列表或元组),确定另一个序列是否在其中的最佳方法是什么? 作为奖励,它应该返回子序列开始的元素的索引:
示例用法(序列中的序列):
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1 # or None, or whatever
到目前为止,我只是依靠蛮力,它看起来很慢、丑陋且笨拙。
This is a generalization of the "string contains substring" problem to (more) arbitrary types.
Given an sequence (such as a list or tuple), what's the best way of determining whether another sequence is inside it? As a bonus, it should return the index of the element where the subsequence starts:
Example usage (Sequence in Sequence):
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1 # or None, or whatever
So far, I just rely on brute force and it seems slow, ugly, and clumsy.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我支持 Knuth-Morris-Pratt 算法。 顺便说一句,您的问题(以及 KMP 解决方案)正是 Python Cookbook 第二版。 您可以在 http://code.activestate.com/recipes/117214/
它找到给定序列中的所有正确的子序列,并且应该用作迭代器:
I second the Knuth-Morris-Pratt algorithm. By the way, your problem (and the KMP solution) is exactly recipe 5.13 in Python Cookbook 2nd edition. You can find the related code at http://code.activestate.com/recipes/117214/
It finds all the correct subsequences in a given sequence, and should be used as an iterator:
这是一种暴力方法
O(n*m)
(类似于@mcella 的回答)。 它可能比纯 PythonO(n+m)
中的 Knuth-Morris-Pratt 算法实现更快(请参阅 @Gregg Lind 答案)。我想知道这种情况下的小有多大?
Here's a brute-force approach
O(n*m)
(similar to @mcella's answer). It might be faster than the Knuth-Morris-Pratt algorithm implementation in pure PythonO(n+m)
(see @Gregg Lind answer) for small input sequences.I wonder how large is the small in this case?
一个简单的方法:转换为字符串并依赖字符串匹配。
使用字符串列表的示例:
使用字符串元组的示例:
使用数字列表的示例:
A simple approach: Convert to strings and rely on string matching.
Example using lists of strings:
Example using tuples of strings:
Example using lists of numbers:
与字符串匹配相同,先生...Knuth-Morris-Pratt 字符串匹配
Same thing as string matching sir...Knuth-Morris-Pratt string matching
抱歉,我不是算法专家,这只是我目前能想到的最快的事情,至少我认为它看起来不错(对我来说),而且我编写它很有趣。 ;-)
最有可能的是你的暴力方法正在做同样的事情。
Sorry I'm not an algorithm expert, it's just the fastest thing my mind can think about at the moment, at least I think it looks nice (to me) and I had fun coding it. ;-)
Most probably it's the same thing your brute force approach is doing.
对于小图案来说,蛮力可能没问题。
对于较大的算法,请查看 Aho-Corasick 算法。
Brute force may be fine for small patterns.
For larger ones, look at the Aho-Corasick algorithm.
这是另一个 KMP 实现:
Here is another KMP implementation:
我参加聚会有点晚了,但这里有一些使用字符串的简单操作:
正如 Ilya V. Schurov 所指出的,在这种情况下,find 方法将不会返回具有多字符字符串或多位数字的正确索引。
I'm a bit late to the party, but here's something simple using strings:
As noted by Ilya V. Schurov, the find method in this case will not return the correct indices with multi-character strings or multi-digit numbers.
为了它的价值,我尝试使用像这样的双端队列:
双端队列实现的一个优点是它只在干草堆上进行一次线性传递。 因此,如果 haystack 正在流式传输,那么它仍然可以工作(与依赖切片的解决方案不同)。
解决方案仍然是暴力破解,O(n*m)。 一些简单的本地基准测试表明,它比
str.index
中字符串搜索的 C 实现慢约 100 倍。For what it's worth, I tried using a deque like so:
One advantage of the deque implementation is that it makes only a single linear pass over the haystack. So if the haystack is streaming then it will still work (unlike the solutions that rely on slicing).
The solution is still brute-force, O(n*m). Some simple local benchmarking showed it was ~100x slower than the C-implementation of string searching in
str.index
.另一种方法,使用集合:
Another approach, using sets: