字符串分析
给定一系列操作:
a*b*a*b*a*a*b*a*b
有没有办法获得最佳细分以实现子字符串的重用。
制作
a*b*a*b*a*a*b*a*b => c*a*c,其中 c = a*b*a*b
然后看到
a*b*a*b => d*d,其中 d = a*b
总共将 8 个初始操作减少到此处描述的 4 个?
(c = (d = a*b)*d)*a*c
的操作数量
当然,目标是最大限度地减少我正在考虑的后缀树
。我对线性时间启发法或解决方案特别感兴趣。 “*”运算实际上是矩阵乘法。
Given a sequence of operations:
a*b*a*b*a*a*b*a*b
is there a way to get the optimal subdivision to enable reusage of substring.
making
a*b*a*b*a*a*b*a*b => c*a*c, where c = a*b*a*b
and then seeing that
a*b*a*b => d*d, where d = a*b
all in all reducing the 8 initial operations into the 4 described here?
(c = (d = a*b)*d)*a*c
The goal of course is to minimize the number of operations
I'm considering a suffixtree of sorts.
I'm especially interested in linear time heuristics or solutions.
The '*' operations are actually matrix multiplications.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
整个问题被称为“公共子表达式消除”或 CSE。这是实现者面临的名为“Graph Reduction”问题的一个稍微小的版本函数式编程语言的编译器。谷歌搜索“公共子表达式消除算法”给出了很多解决方案,尽管我看不到任何解决方案,特别是对于矩阵乘法给出的约束。
链接的页面提供了大量参考资料。
我的旧答案如下。然而,经过更多研究,解决方案只是构建一个后缀树。这可以在 O(N) 时间内完成(维基百科页面上有很多参考资料)。完成此操作后,子表达式(问题中的 c、d 等)只是后缀树中的节点 - 只需将它们拉出来即可。
然而,我认为 MarcoS 的建议是最长重复子串,作为图缩减优先级可能不允许此处允许的优化。
算法草图:
每次运行最长的重复子串需要时间 N。您可以重新使用 后缀树 你的构建是为了及时解决整个问题 N.
This whole problem is known as "Common Subexpression Elimination" or CSE. It is a slightly smaller version of the problem called "Graph Reduction" faced by the implementer of compilers for functional programming languages. Googling "Common Subexpression elimination algorithm" gives lots of solutions, though none that I can see especially for the constraints given by matrix multiplication.
The pages linked to give a lot of references.
My old answer is below. However, having researched a bit more, the solution is simply building a suffix tree. This can be done in O(N) time (lots of references on the wikipedia page). Having done this, the sub-expressions (c, d etc. in your question) are just nodes in the suffix tree - just pull them out.
However, I think MarcoS is on to something with the suggestion of Longest repeating Substring, as graph reduction precedence might not allow optimisations that can be allowed here.
sketch of algorithm:
Each run of longest repeating substring takes time N. You can probably re-use the suffix tree you build to solve the whole thing in time N.
编辑:除了已接受的答案之外,还需要该答案中的增长顺序才能运行 CSE 或矩阵链乘法
有趣的是,压缩算法可能就是您想要的:压缩算法寻求减少其压缩内容的大小,如果它能做到这一点的唯一方法是替换,您可以跟踪它并获取算法所需的子组件。尽管对于小输入,这可能不会给出好的结果。
哪些操作子集是可交换的将是选择此类算法时的重要考虑因素。 [编辑:OP说在他/她的情况下没有操作是可交换的]
如果我们忽略缓存等影响,我们还可以定义一个最佳解决方案:
存在一个关于贪婪方法是否可行的问题:是否应该压缩重复每一步的子串。情况可能并非如此,例如
,但是我有一种预感,如果我们尝试压缩子字符串的所有顺序,我们可能不会经常遇到这个问题。
因此,写下我们想要的(成本)并考虑可能的问题后,我们已经有了一个可以做到这一点的强力算法,并且它将运行非常少量的矩阵:
注意:根据 Asgeir 的另一个答案,这称为矩阵链乘法优化问题。 Nick Fortescue 指出,这也更普遍地称为 http://en.wikipedia.org/wiki/Common_subexpression_elimination——因此,人们可以从文献中找到任何通用的 CSE 或矩阵链乘法算法/库,并插入我之前提到的成本数量级 (无论您使用哪种解决方案,您都需要这些)。请注意,上述计算(乘法、求幂等)的成本假设它们是通过最先进的算法有效完成的;如果不是这种情况,请将指数替换为与执行操作的方式相对应的适当值。
edit: The orders-of-growth in this answer are needed in addition to the accepted answer in order to run CSE or matrix-chain multiplication
Interestingly, a compression algorithm may be what you want: a compression algorithm seeks to reduce the size of what it's compressing, and if the only way it can do that is substitution, you can trace it and obtain the necessary subcomponents for your algorithm. This may not give nice results though for small inputs.
What subsets of your operations are commutative will be an important consideration in choosing such an algorithm. [edit: OP says no operations are commutative in his/her situation]
We can also define an optimal solution, if we ignore effects such as caching:
There is the question about whether a greedy approach is feasible for not: whether one SHOULD compress repeating substrings at each step. This may not be the case, e.g.
However I have a hunch that, if we try all orders of compressing substrings, we will probably not run into this issue too often.
Thus having written down what we want (the costs) and considered possibly issues, we already have a brute-force algorithm which can do this, and it will run for very small numbers of matrices:
Note: according to another answer by Asgeir, this is known as the Matrix Chain Multiplication optimization problem. Nick Fortescue notes this is also known more generally as http://en.wikipedia.org/wiki/Common_subexpression_elimination -- thus one could find any generic CSE or Matrix-Chain-Multiplication algorithm/library from the literature, and plug in the cost orders-of-magnitude I mentioned earlier (you will need those nomatter which solution you use). Note that the cost of the above calculations (multiplication, exponentiation, etc.) assume that they are being done efficiently with state-of-the-art algorithms; if this is not the case, replace the exponents with appropriate values which correspond to the way the operations will be carried out.
如果您想使用最少的算术运算,那么您应该看看矩阵链乘法,它可以减少到 O(n log n)
If you want to use the fewest arithmetic operations then you should have a look at matrix chain multiplication which can be reduced to O(n log n)
从头顶上看,这个问题对我来说似乎是 NP 。根据您正在进行的替换,其他替换可能或不可能,例如对于字符串
d*e*a*b*c*d*e*a*b*c*d*e*a
有几种可能性。如果取最长的公共字符串,它将是:
f = d*e*a*b*c
并且您可以替换f*f*e*a
最后留下三个乘法和四个中间乘法(总共七)。如果您用以下方式替换:
f = d*e*a
您将得到f*b*c*f*b*c*f
,您可以使用g = f*b 进一步替换它*c
到g*g*f
总共六次乘法。这个问题还有其他可能的替代方案,但我现在没有时间一一列举。
我猜测对于完整的最小替换,不仅需要找出最长的公共子串,还需要找出每个子串重复的次数,这可能意味着您必须跟踪到目前为止的所有替换并进行回溯。但它仍然可能比实际的乘法更快。
From the top of the head the problem seems in NP for me. Depending on the substitutions you are doing other substitions will be possible or impossible for example for the string
d*e*a*b*c*d*e*a*b*c*d*e*a
there are several possibilities.If you take the longest common string it will be:
f = d*e*a*b*c
and you could substitutef*f*e*a
leaving you with three multiplications in the end and four intermediate ones (total seven).If you instead substitute the following way:
f = d*e*a
you getf*b*c*f*b*c*f
which you can further substitute usingg = f*b*c
tog*g*f
for a total of six multiplication.There are other possible substitutions in this problem, but I do not have the time to count them all right now.
I am guessing for a complete minimal substitution it is not only necessary to figure out the longest common substring but also the number of times each substring repeats, which probably means you have to track all substitutions so far and do backtracking. Still it might be faster than the actual multiplications.
这不是最长重复子串问题吗?
Isn't this the Longest repeated substring problem?