在列表中查找相同的模式
假设我们有以下列表:
List<int> Journey1 = new List<int>() { 1, 2, 3, 4, 5 };
List<int> Journey2 = new List<int>() { 2, 3, 4, 6, 7, 3, 4 };
List<int> Journey3 = new List<int>() { 6, 7, 1 };
List<int> Journey4 = new List<int>() { 3, 1, 4 };
模式为:
2, 3, 4 -> Journey1, Journey2;
6, 7 -> Journey2, Journey3;
1 -> Journey2, Journey3, Journey4;
5 -> Journey1;
3, 4 -> Journey2;
3 -> Journey4;
4 -> Journey4;
我们有 5000 个列表,每个列表大约有 200 个项目,因此模式可以有 1-200 个项目,并且可以在 1-5000 个列表中看到。
因此我需要非常快速的模式匹配方法。
Assume we have below lists:
List<int> Journey1 = new List<int>() { 1, 2, 3, 4, 5 };
List<int> Journey2 = new List<int>() { 2, 3, 4, 6, 7, 3, 4 };
List<int> Journey3 = new List<int>() { 6, 7, 1 };
List<int> Journey4 = new List<int>() { 3, 1, 4 };
And the patterns are:
2, 3, 4 -> Journey1, Journey2;
6, 7 -> Journey2, Journey3;
1 -> Journey2, Journey3, Journey4;
5 -> Journey1;
3, 4 -> Journey2;
3 -> Journey4;
4 -> Journey4;
We have 5000 lists, and each has around 200 items, so the patterns can have between 1-200 items and can be seen in 1-5000 lists.
Therefore I need very fast way of pattern matching.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
无需预先计算并使用简单的即时搜索:
这将最多执行 2 亿个等于检查您的“数字”。但由于预计不会对整个模式执行检查,因此如果检查所有列表,则可能(只是猜测)约 500 万次等于操作。那是几百毫秒。
这完全取决于什么对您来说“非常快”。如果这太慢,您将需要一种更复杂的方法......
Without precomputation and with a naive on-the-fly search:
This will execute at max 200 million equals checks for your 'numbers'. But since checks are not expected to be executed for whole patterns, that could be (just a guess) ~5 million equals operations if checking all the lists. That's a few hundred milliseconds.
It all depends on what is 'very fast' for you. If that's too slow, you will need a much much more complicated approach ...
我不确定你想要什么作为输出。我刚刚尝试了一下。
我建议您创建一个列表列表,而不是声明单独的列表变量。
我假设数字范围从 0 到 255。通过此查询
和此测试循环,
您将得到此结果
,其中列表编号从零开始。
I am not sure what you want as output. I just made a Try.
I suggest that you make a list of lists, instead of declaring individual list variables.
I assumed that the numbers range from 0 to 255. With this query
and this test loop
you will get this result
Where the lists are numbered starting at zero.
对于暴力方法,您可以尝试使用多项式哈希函数来加速子部分匹配。仍然需要疯狂的比较次数,但至少匹配可以几乎恒定,而不管子序列长度如何。
For brute force approach you can try to use polynomial hash-functions to speed up sub-section matches. Still insane number of comparisons required, but at least match could be almost constant irrespective of sub-sequence length.
在您的情况下,有机会从模式预处理和文本预处理中受益(http://en.wikipedia.org/wiki/String_searching_algorithm)。
例如,为列表中的所有子序列构建一个字典树将允许在与模式长度成比例的时间内查询该列表以查找给定模式。
In your case there are opportunities to benefit from pattern preprocessing as well as text preprocessing (http://en.wikipedia.org/wiki/String_searching_algorithm).
For instance, constructing a trie for all subsequences in a list will allow to query this list for a given pattern in time proportional to the pattern length.