当前位置：文江博客话题详情

字符串分析

发布于 2024-11-06 11:55:39 字数 456 浏览 8 评论 0原文

给定一系列操作：

a*b*a*b*a*a*b*a*b

有没有办法获得最佳细分以实现子字符串的重用。

制作

a*b*a*b*a*a*b*a*b => c*a*c，其中 c = a*b*a*b

然后看到

a*b*a*b => d*d，其中 d = a*b

总共将 8 个初始操作减少到此处描述的 4 个？

(c = (d = a*b)*d)*a*c

的操作数量

当然，目标是最大限度地减少我正在考虑的后缀树

。我对线性时间启发法或解决方案特别感兴趣。 “*”运算实际上是矩阵乘法。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

給妳壹絲溫柔 2024-11-13 11:55:39

整个问题被称为“公共子表达式消除”或 CSE。这是实现者面临的名为“Graph Reduction”问题的一个稍微小的版本函数式编程语言的编译器。谷歌搜索“公共子表达式消除算法”给出了很多解决方案，尽管我看不到任何解决方案，特别是对于矩阵乘法给出的约束。

链接的页面提供了大量参考资料。

我的旧答案如下。然而，经过更多研究，解决方案只是构建一个后缀树。这可以在 O(N) 时间内完成（维基百科页面上有很多参考资料）。完成此操作后，子表达式（问题中的 c、d 等）只是后缀树中的节点 - 只需将它们拉出来即可。

然而，我认为 MarcoS 的建议是最长重复子串，作为图缩减优先级可能不允许此处允许的优化。

算法草图：

optimise(s):
    let sub = longestRepeatingSubstring(s).
    optimisedSub = optimise(sub)
    return s with sub replaced by optimisedSub

每次运行最长的重复子串需要时间 N。您可以重新使用后缀树你的构建是为了及时解决整个问题 N.

This whole problem is known as "Common Subexpression Elimination" or CSE. It is a slightly smaller version of the problem called "Graph Reduction" faced by the implementer of compilers for functional programming languages. Googling "Common Subexpression elimination algorithm" gives lots of solutions, though none that I can see especially for the constraints given by matrix multiplication.

The pages linked to give a lot of references.

My old answer is below. However, having researched a bit more, the solution is simply building a suffix tree. This can be done in O(N) time (lots of references on the wikipedia page). Having done this, the sub-expressions (c, d etc. in your question) are just nodes in the suffix tree - just pull them out.

However, I think MarcoS is on to something with the suggestion of Longest repeating Substring, as graph reduction precedence might not allow optimisations that can be allowed here.

sketch of algorithm:

optimise(s):
    let sub = longestRepeatingSubstring(s).
    optimisedSub = optimise(sub)
    return s with sub replaced by optimisedSub

Each run of longest repeating substring takes time N. You can probably re-use the suffix tree you build to solve the whole thing in time N.

回复收藏 0 原文

递刀给你 2024-11-13 11:55:39

编辑：除了已接受的答案之外，还需要该答案中的增长顺序才能运行 CSE 或矩阵链乘法

有趣的是，压缩算法可能就是您想要的：压缩算法寻求减少其压缩内容的大小，如果它能做到这一点的唯一方法是替换，您可以跟踪它并获取算法所需的子组件。尽管对于小输入，这可能不会给出好的结果。

哪些操作子集是可交换的将是选择此类算法时的重要考虑因素。 [编辑：OP说在他/她的情况下没有操作是可交换的]

如果我们忽略缓存等影响，我们还可以定义一个最佳解决方案：

input: [some product of matrices to compute]

given that multiplying two NxN matrices is O(N^2.376)
given we can visualize the product as follows:
    [[AxB][BxC][CxD][DxE]...]
we must for example perform O(max(A,B,C)^2.376) or so operations in order to combine
    [AxB][BxC] -> [AxC]

The max(...) is an estimate based on how fast it is to multiply two square matrices;
a better estimate of cost(A,B,C) for multiplying an AxB * BxC matrix can be gotten 
from actually looking at the algorithm, or running benchmarks if you don't know the
algorithm used.

However note that multiplying the same matrix with itself, i.e. calculating
a power, can be much more efficient, and we also need to take that into account.
At worst, it takes log_2(power) multiplies each of O(N^2.376), but this could be
made more efficient by diagonalizing the matrix first.

存在一个关于贪婪方法是否可行的问题：是否应该压缩重复每一步的子串。情况可能并非如此，例如

aaaaabaab
compressing 'aa' results in ccabcb and compressing 'aab' is now impossible

，但是我有一种预感，如果我们尝试压缩子字符串的所有顺序，我们可能不会经常遇到这个问题。

因此，写下我们想要的（成本）并考虑可能的问题后，我们已经有了一个可以做到这一点的强力算法，并且它将运行非常少量的矩阵：

# pseudocode

def compress(problem, substring)
    x = new Problem(problem)
    x.string.replaceall(substring, newsymbol)
    x.subcomputations += Subcomputation(newsymbol=substring)

def bestCompression(problem)
    candidateCompressions = [compress(problem,substring) for each substring in problem.string]
    # etc., recursively return problem with minimum cost
    # dynamic programming may help make this more efficient, but one must watch
    #   out for the note above, how it may be hard to be greedy

注意：根据 Asgeir 的另一个答案，这称为矩阵链乘法优化问题。 Nick Fortescue 指出，这也更普遍地称为 http://en.wikipedia.org/wiki/Common_subexpression_elimination——因此，人们可以从文献中找到任何通用的 CSE 或矩阵链乘法算法/库，并插入我之前提到的成本数量级 （无论您使用哪种解决方案，您都需要这些）。请注意，上述计算（乘法、求幂等）的成本假设它们是通过最先进的算法有效完成的；如果不是这种情况，请将指数替换为与执行操作的方式相对应的适当值。

edit: The orders-of-growth in this answer are needed in addition to the accepted answer in order to run CSE or matrix-chain multiplication

Interestingly, a compression algorithm may be what you want: a compression algorithm seeks to reduce the size of what it's compressing, and if the only way it can do that is substitution, you can trace it and obtain the necessary subcomponents for your algorithm. This may not give nice results though for small inputs.

What subsets of your operations are commutative will be an important consideration in choosing such an algorithm. [edit: OP says no operations are commutative in his/her situation]

We can also define an optimal solution, if we ignore effects such as caching:

input: [some product of matrices to compute]

given that multiplying two NxN matrices is O(N^2.376)
given we can visualize the product as follows:
    [[AxB][BxC][CxD][DxE]...]
we must for example perform O(max(A,B,C)^2.376) or so operations in order to combine
    [AxB][BxC] -> [AxC]

The max(...) is an estimate based on how fast it is to multiply two square matrices;
a better estimate of cost(A,B,C) for multiplying an AxB * BxC matrix can be gotten 
from actually looking at the algorithm, or running benchmarks if you don't know the
algorithm used.

However note that multiplying the same matrix with itself, i.e. calculating
a power, can be much more efficient, and we also need to take that into account.
At worst, it takes log_2(power) multiplies each of O(N^2.376), but this could be
made more efficient by diagonalizing the matrix first.

There is the question about whether a greedy approach is feasible for not: whether one SHOULD compress repeating substrings at each step. This may not be the case, e.g.

aaaaabaab
compressing 'aa' results in ccabcb and compressing 'aab' is now impossible

However I have a hunch that, if we try all orders of compressing substrings, we will probably not run into this issue too often.

Thus having written down what we want (the costs) and considered possibly issues, we already have a brute-force algorithm which can do this, and it will run for very small numbers of matrices:

# pseudocode

def compress(problem, substring)
    x = new Problem(problem)
    x.string.replaceall(substring, newsymbol)
    x.subcomputations += Subcomputation(newsymbol=substring)

def bestCompression(problem)
    candidateCompressions = [compress(problem,substring) for each substring in problem.string]
    # etc., recursively return problem with minimum cost
    # dynamic programming may help make this more efficient, but one must watch
    #   out for the note above, how it may be hard to be greedy

Note: according to another answer by Asgeir, this is known as the Matrix Chain Multiplication optimization problem. Nick Fortescue notes this is also known more generally as http://en.wikipedia.org/wiki/Common_subexpression_elimination -- thus one could find any generic CSE or Matrix-Chain-Multiplication algorithm/library from the literature, and plug in the cost orders-of-magnitude I mentioned earlier (you will need those nomatter which solution you use). Note that the cost of the above calculations (multiplication, exponentiation, etc.) assume that they are being done efficiently with state-of-the-art algorithms; if this is not the case, replace the exponents with appropriate values which correspond to the way the operations will be carried out.

回复收藏 0 原文