使用大而密集的 2D 矩阵快速计数 2D 子矩阵?
在更大、更密集的矩阵中计算子矩阵的好算法是什么?如果我有一行数据,我可以使用后缀树,但我不确定将后缀树推广到更高的维度是否是完全简单的或者是这里的最佳方法。
想法?
我对密集矩阵的第一个元素进行索引并消除全矩阵搜索的简单解决方案仅比全矩阵扫描提供了适度的改进。
解决这个问题的最佳方法是什么?
Example:
Input:
Full matrix:
123
212
421
Search matrix:
12
21
Output:
2
该子矩阵在完整矩阵中出现两次,因此输出为 2。完整矩阵可能是 1000x1000,但是,搜索矩阵大到 100x100(可变大小),我需要连续处理多个搜索矩阵。因此,这个问题的暴力破解效率太低,无法满足我对多个矩阵的亚秒级搜索时间。
What's a good algorithm for counting submatrices within a larger, dense matrix? If I had a single line of data, I could use a suffix tree, but I'm not sure if generalizing a suffix tree into higher dimensions is exactly straightforward or the best approach here.
Thoughts?
My naive solution to index the first element of the dense matrix and eliminate full-matrix searching provided only a modest improvement over full-matrix scanning.
What's the best way to solve this problem?
Example:
Input:
Full matrix:
123
212
421
Search matrix:
12
21
Output:
2
This sub-matrix occurs twice in the full matrix, so the output is 2. The full matrix could be 1000x1000, however, with a search matrix as large as 100x100 (variable size), and I need to process a number of search matrices in a row. Ergo, a brute force of this problem is far too inefficient to meet my sub-second search time for several matrices.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于算法课程,我曾经做过一个练习,其中 Rabin-Karp 字符串搜索算法必须稍微扩展才能以您描述的方式搜索匹配的二维子矩阵。
我认为,如果您花时间理解维基百科上描述的算法,那么您就会清楚将其扩展到二维的自然方法。本质上,您只需在矩阵上进行几次遍历,一次沿着一列爬行。有一些小技巧可以使时间复杂度尽可能低,但您可能根本不需要它们。
在 N×N 矩阵中搜索 M×M 矩阵,这种方法应该为您提供 O(N²⋅M) 算法。通过技巧,我相信它可以细化为 O(N²)。
For an algorithms course, I once worked an exercise in which the Rabin-Karp string-search algorithm had to be extended slightly to search for a matching two-dimensional submatrix in the way you describe.
I think if you take the time to understand the algorithm as it is described on Wikipedia, the natural way of extending it to two dimensions will be clear to you. In essence, you just make several passes over the matrix, creeping along one column at a time. There are some little tricks to keep the time complexity as low as possible, but you probably won't even need them.
Searching an N×N matrix for a M×M matrix, this approach should give you an O(N²⋅M) algorithm. With tricks, I believe it can be refined to O(N²).
计算手册的算法和理论建议什么是 N^2 * log(字母大小) 解决方案。给定一个要搜索的子矩阵,首先对其行进行重复数据删除。现在请注意,如果您逐行搜索大矩阵,则最多有一个已删除的行可以出现在任何位置。使用 Aho-Corasick 及时搜索 N^2 * log(字母大小),并在大矩阵中的每个单元格处记下空值或子矩阵匹配行的标识符。现在再次使用 Aho-Corasick 向下搜索该行匹配矩阵的列,并在所有行都位于彼此下方的情况下发出匹配信号。
Algorithms and Theory of Computation Handbook suggests what is an N^2 * log(Alphabet Size) solution. Given a sub-matrix to search for, first of all de-dupe its rows. Now note that if you search the large matrix row by row at most one of the de-duped rows can appear at any position. Use Aho-Corasick to search this in time N^2 * log(Alphabet Size) and write down at each cell in the large matrix either null or an identifier for the matching row of the sub-matrix. Now use Aho-Corasick again to search down the columns of this matrix of row matches and signal a match where all the rows are present below each other.
这听起来类似于模板匹配。如果有动力,您可能可以使用 FFT 转换原始数组,从暴力搜索中删除日志。 (Nlog(M)) 而不是 (NM)
This sounds similar to template matching. If motivated you could probably transform your original array with the FFT and drop a log from a brute force search. (Nlog(M)) instead of (NM)
我没有现成的答案,但我将如何开始:
-- 您想要非常快速的查找,您可以花多少(时间)来构建索引结构?当暴力破解不够快时,您需要索引。
-- 关于您的数据,您还知道哪些没有告诉我们的信息?您的所有矩阵中的所有值都是单位数整数吗?
-- 如果它们是单位数整数(或任何可以表示为单个字符或索引值的其他值),请考虑线性化您的 2D 结构。一种方法是沿着从右上角到左下角的对角线读取矩阵,并从左上角到右下角扫描。很难用语言解释,但请阅读矩阵:
如 125369470c8adbef
(明白吗?)
现在您可以将超级矩阵索引到您的速度和空间要求所需的任何深度;在我的示例中,键 1253... 指向元素 (1,1),键 abef 指向元素 (3,3)。不确定这是否适合您,您必须调整解决方案的参数。选择您最喜欢的存储键值对的方法:哈希、列表,甚至在情况变得疯狂时在索引中构建一些索引。
问候
马克
I don't have a ready answer but here's how I would start:
-- You want very fast lookup, how much (time) can you spend on building index structures? When brute-force isn't fast enough you need indexes.
-- What do you know about your data that you haven't told us? Are all the values in all your matrices single-digit integers?
-- If they are single-digit integers (or anything else you can represent as a single character or index value), think about linearising your 2D structures. One way to do this would be to read the matrix along a diagonal running top-right to bottom-left and scanning from top-left to bottom-right. Difficult to explain in words, but read the matrix:
as 125369470c8adbef
(get it?)
Now you can index your super-matrix to whatever depth your speed and space requirements demand; in my example key 1253... points to element (1,1), key abef points to element (3,3). Not sure if this works for you, and you'll have to play around with the parameters to your solution. Choose your favourite method for storing the key-value pairs: a hash, a list, or even build some indexes into the index if things get wild.
Regards
Mark