动态解析器​​/预生成空间/时间权衡注意事项

发布于 2024-11-08 00:37:34 字数 1027 浏览 5 评论 0原文

使用即时解析器与空间相关的好处是否超过预先生成的查找表与时间相关的好处?


长版

我正在编写一个化学参考工具,并且包含一个功能,可以自动命名符合特定模式的公式;例如 C[n]H[2n+2] => [n]ane;其中 [n] 是 LHS 的整数;以及 RHS 上名称数组的索引。 (meth, eth, ...)

据我所知,这可以通过以下两种方式之一实现:

  1. 我预先生成一个键/值对偶公式<=>的查找字典名称 对;无论是在应用程序启动时(启动速度较慢),还是与应用程序一起发布的静态列表(下载速度较慢)。

  2. 公式由定制的解析器即时评估。

方法 1. 中 name =>公式查找变得简单一个数量级;但是,除非我想通过应用程序发送数十兆字节的数据,否则生成器必须有一个预设且相当低的 n 值。

更复杂的是,公式可以有多个项;例如C[n]H[2n+1]OC[n']H[2n'+1];对于其中的每一个,可能的匹配数量随着n呈几何级数增加。此外,使用这种方法会消耗内存,这与任何人无关。

方法 2. 让我可以使用相当小的查找表来支持相当大的 n 值,但使 name =>公式查找稍微复杂一些。与随应用程序一起交付的预生成文件相比,它还允许我纠正生成逻辑中的错误,而无需交付新的数据文件。

这还要求每个公式都与多个规则的粗略测试相匹配,以确定它是否可以适合;如果有很多规则,则需要时间,可能会导致界面明显变慢。

那么问题是:

  1. 在我未能考虑的权衡中是否有任何考虑因素,或者我没有考虑的方法?

  2. 使用即时解析器的好处是否证明增加的实现复杂性是合理的?

Do the space-related benefits of using an on-the-fly parser outweigh the time-related benefits of a pre-generated lookup table?


Long version:

I am authoring a chemistry reference tool, and am including a feature that will automatically name formulae conforming to a specific pattern; e.g. C[n]H[2n+2] => [n]ane; where [n] is an integer for the LHS; and an index into an array of names on the RHS. (meth, eth, …)

As far as I can see, this can be implemented in one of two ways:

  1. I pre-generate a key/value dual lookup dictionary of formula <=> name pairs; either when the application starts (slower startup), or a static list which is published with the application (slower download).

  2. Formulae are evaluated on the fly by a custom-built parser.

In approach 1. name => formula lookup becomes simpler by an order of magnitude; but the generator will, unless I want to ship dozens of megabytes of data with the application, have to have a preset, and fairly low, value for n.

Compounding this is the fact that formulae can have several terms; such as C[n]H[2n+1]OC[n']H[2n'+1]; and for each of these, the number of possible matches increases geometrically with n. Additionally, using this approach would eat RAM like nobody's business.

Approach 2. lets me support fairly large values of n using a fairly small lookup table, but makes name => formula lookup somewhat more complex. Compared to the pre-generation to file for shipping with the application, it also lets me correct errors in the generation logic without having to ship new data files.

This also requires that each formula be matched against a cursory test for several rules, determining if it could fit; which, if there are a lot of rules, takes time that might lead to noticeable slowdowns in the interface.

The question then, is:

  1. Are there any considerations in the tradeoff I have failed to account for, or approaches that I haven't considered?

  2. Do the benefits of using an on-the-fly parser justify the increased implementation complexity?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

猥︴琐丶欲为 2024-11-15 00:37:34

您应该采用第二种方法。

一种可能的解决方案是贪心算法。将一组转换定义为正则表达式(用于测试模式)和一个函数,该函数被赋予正则表达式匹配对象并返回转换后的字符串。

正则表达式不够强大,无法直接处理您想要的内容。相反,您必须执行类似的操作:(

m = re.match(r"C\[(\d+)\]H\[(\d+)]\]", formula)
if m:
    C_count, H_count = int(m.group(1)), int(m.group(2))
    match_size = len(m.group(0))
    if C_count*2+2 == H_count:
        replacement = alkane_lookup[C_count]
    elif C_count*2 == H_count:
        replacement = alkene_lookup[C_count]
    ...
    else:
        replacement = m.group(0)  # no replacement available

加上其他可能性的更多内容),

然后将其嵌入到如下所示的循环中:(

formula = "...."
new_formula = ""
while formula:
    match_size, replacement = find_replacement(formula)
    new_formula += replacement
    formula = formula[match_size:]

您需要处理没有匹配项的情况。一种可能的方法是包含find_replacement() 末尾所有可能元素的列表,仅返回下一个元素并进行计数。)

这是一种贪心算法,不能保证最小解决方案。这更复杂,但由于化学家本身对正确形式有不同的想法,所以我不会太担心。

You should go with the second approach.

One possible solution is a greedy algorithm. Define your set of transformations as a regular expression (used to test the pattern) and a function which is given the regexp match object and returns the transformed string.

Regular expressions aren't quite powerful enough to handle what you want directly. Instead you'll have to do something like:

m = re.match(r"C\[(\d+)\]H\[(\d+)]\]", formula)
if m:
    C_count, H_count = int(m.group(1)), int(m.group(2))
    match_size = len(m.group(0))
    if C_count*2+2 == H_count:
        replacement = alkane_lookup[C_count]
    elif C_count*2 == H_count:
        replacement = alkene_lookup[C_count]
    ...
    else:
        replacement = m.group(0)  # no replacement available

(plus a lot more for the other possibilities)

then embed that in a loop which looks like:

formula = "...."
new_formula = ""
while formula:
    match_size, replacement = find_replacement(formula)
    new_formula += replacement
    formula = formula[match_size:]

(You'll need to handle the case where nothing matches. One possible way is to include a list of all possible elements at the end of find_replacement(), which only returns the next element and counts.)

This is a greedy algorithm, which doesn't guarantee the smallest solution. That's more complicated, but since chemists themselves have different ideas of the right form, I wouldn't worry so much about it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文