动态解析器/预生成空间/时间权衡注意事项

发布于 2024-11-08 00:37:34 字数 1027 浏览 9 评论 0原文

使用即时解析器与空间相关的好处是否超过预先生成的查找表与时间相关的好处？

长版：

我正在编写一个化学参考工具，并且包含一个功能，可以自动命名符合特定模式的公式；例如 C[n]H[2n+2] => [n]ane；其中 [n] 是 LHS 的整数；以及 RHS 上名称数组的索引。 (meth, eth, ...)

据我所知，这可以通过以下两种方式之一实现：

我预先生成一个键/值对偶公式<=>的查找字典名称 对；无论是在应用程序启动时（启动速度较慢），还是与应用程序一起发布的静态列表（下载速度较慢）。
公式由定制的解析器即时评估。

在方法 1. 中 name =>公式查找变得简单一个数量级；但是，除非我想通过应用程序发送数十兆字节的数据，否则生成器必须有一个预设且相当低的 n 值。

更复杂的是，公式可以有多个项；例如C[n]H[2n+1]OC[n']H[2n'+1]；对于其中的每一个，可能的匹配数量随着n呈几何级数增加。此外，使用这种方法会消耗内存，这与任何人无关。

方法 2. 让我可以使用相当小的查找表来支持相当大的 n 值，但使 name =>公式查找稍微复杂一些。与随应用程序一起交付的预生成文件相比，它还允许我纠正生成逻辑中的错误，而无需交付新的数据文件。

这还要求每个公式都与多个规则的粗略测试相匹配，以确定它是否可以适合；如果有很多规则，则需要时间，可能会导致界面明显变慢。

那么问题是：

在我未能考虑的权衡中是否有任何考虑因素，或者我没有考虑的方法？
使用即时解析器的好处是否证明增加的实现复杂性是合理的？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猥︴琐丶欲为 2024-11-15 00:37:34

您应该采用第二种方法。

一种可能的解决方案是贪心算法。将一组转换定义为正则表达式（用于测试模式）和一个函数，该函数被赋予正则表达式匹配对象并返回转换后的字符串。

正则表达式不够强大，无法直接处理您想要的内容。相反，您必须执行类似的操作：（

m = re.match(r"C\[(\d+)\]H\[(\d+)]\]", formula)
if m:
    C_count, H_count = int(m.group(1)), int(m.group(2))
    match_size = len(m.group(0))
    if C_count*2+2 == H_count:
        replacement = alkane_lookup[C_count]
    elif C_count*2 == H_count:
        replacement = alkene_lookup[C_count]
    ...
    else:
        replacement = m.group(0)  # no replacement available

加上其他可能性的更多内容），

然后将其嵌入到如下所示的循环中：（

formula = "...."
new_formula = ""
while formula:
    match_size, replacement = find_replacement(formula)
    new_formula += replacement
    formula = formula[match_size:]

您需要处理没有匹配项的情况。一种可能的方法是包含find_replacement() 末尾所有可能元素的列表，仅返回下一个元素并进行计数。）

这是一种贪心算法，不能保证最小解决方案。这更复杂，但由于化学家本身对正确形式有不同的想法，所以我不会太担心。

You should go with the second approach.

One possible solution is a greedy algorithm. Define your set of transformations as a regular expression (used to test the pattern) and a function which is given the regexp match object and returns the transformed string.

Regular expressions aren't quite powerful enough to handle what you want directly. Instead you'll have to do something like:

m = re.match(r"C\[(\d+)\]H\[(\d+)]\]", formula)
if m:
    C_count, H_count = int(m.group(1)), int(m.group(2))
    match_size = len(m.group(0))
    if C_count*2+2 == H_count:
        replacement = alkane_lookup[C_count]
    elif C_count*2 == H_count:
        replacement = alkene_lookup[C_count]
    ...
    else:
        replacement = m.group(0)  # no replacement available

(plus a lot more for the other possibilities)

then embed that in a loop which looks like:

formula = "...."
new_formula = ""
while formula:
    match_size, replacement = find_replacement(formula)
    new_formula += replacement
    formula = formula[match_size:]

(You'll need to handle the case where nothing matches. One possible way is to include a list of all possible elements at the end of find_replacement(), which only returns the next element and counts.)

This is a greedy algorithm, which doesn't guarantee the smallest solution. That's more complicated, but since chemists themselves have different ideas of the right form, I wouldn't worry so much about it.

回复收藏 0 原文

~没有更多了~