解析化学式
我正在尝试为应用程序编写一种方法,该应用程序采用“CH3COOH”等化学公式并返回某种充满其符号的集合。
CH3COOH 将返回 [C,H,H,H,C,O,O,H]
我已经有了一些可以工作的东西,但它非常复杂,并且使用了大量带有大量嵌套 if-else 结构和循环的代码。
有没有办法通过使用 String.split 的某种正则表达式或在其他一些出色的简单代码中来做到这一点?
I'm trying to write a method for an app that takes a chemical formula like "CH3COOH" and returns some sort of collection full of their symbols.
CH3COOH would return [C,H,H,H,C,O,O,H]
I already have something that is kinda working, but it's very complicated and uses a lot of code with a lot of nested if-else structures and loops.
Is there a way I can do this by using some kind of regular expression with String.split or maybe in some other brilliant simple code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我编写了一系列关于如何解析分子式的文章,包括更复杂的公式,例如 C6H2(NO2)3CH3 。
最近的是我在 PyCon2010 上的演讲“PLY 和 PyParsing”,其中我使用分子式计算器作为示例问题来比较这两个 Python 解析系统。甚至还有我的演示视频。
该演示文稿基于由三部分组成的系列文章< /a> 我确实使用 ANTLR 开发了一个分子式解析器。在第3部分中,我将 ANTLR 解决方案与手写的正则表达式解析器以及 PLY 和 PyParsing 中的解决方案。
regexp 和 PLY 解决方案最初是在两部分系列中针对两种编写方式开发的Python 中的解析器。
regexp 解决方案和基本 ANTLR/PLY/PyParsing 解决方案使用 [AZ][az]?\d* 等正则表达式来匹配公式中的术语。这就是@David M 的建议。
这是用 Python 计算出来的
当我运行这个程序时(它是硬编码的以使用乙酸,CH3COOH),我得到
请注意,这段简短的代码假设分子式是正确的。如果你给它类似“##$%^O2#$$#”的东西,那么它会忽略它不知道的字段并给出['O','O']。如果你不想这样,那么你就必须让它变得更强大一些。
如果您想支持更复杂的公式,例如 C6H2(NO2)3CH3,那么您需要了解一些有关树数据结构的知识,特别是(正如@Roman 指出的)抽象语法树(通常称为 AST)。这太复杂了,无法在这里展开,所以请参阅我的演讲和文章以了解更多详细信息。
I have developed a couple of series of articles on how to parse molecular formulas, including more complex formulas like C6H2(NO2)3CH3 .
The most recent is my presentation "PLY and PyParsing" at PyCon2010 where I compare those two Python parsing systems using a molecular formula evaluator as my sample problem. There's even a video of my presentation.
The presentation was based on a three-part series of articles I did developing a molecular formula parser using ANTLR. In part 3 I compare the ANTLR solution to a hand-written regular expression parser and solutions in PLY and PyParsing.
The regexp and PLY solutions were first developed in a two-part series on two ways of writing parsers in Python.
The regexp solution and base ANTLR/PLY/PyParsing solutions, use a regular expression like [A-Z][a-z]?\d* to match terms in the formula. This is what @David M suggested.
Here is it worked out in Python
When I run this (it's hard-coded to use acetic acid, CH3COOH) I get
Do note that this short bit of code assumes the molecular formula is correct. If you give it something like "##$%^O2#$$#" then it will ignore the fields it doesn't know about and give ['O', 'O']. If you don't want that then you'll have to make it a bit more robust.
If you want to support more complicated formulas, like C6H2(NO2)3CH3, then you'll need to know a bit about tree data structures, specifically (as @Roman points out), abstract syntax trees (most often called ASTs). That's too complicated to get into here, so see my talk and essays for more details.
假设它的大写正确,等式中的每个符号都与此正则表达式匹配:(
对于化学挑战,元素的符号始终为大写字母,后跟可选的小写字母一或可能的两个 - 例如,Hg 代表汞)
您可以捕获元素符号以及像这样分组的数字:
所以是的,理论上这将是正则表达式可以帮助解决的问题。如果您正在处理类似 C6H2(NO2)3(CH 3)3 那么你的工作当然会更难一点......
Assuming it's correctly capitalised, each symbol in the equation matches this regular expression:
(For the chemically challenged, an element's symbol is always capital letter followed by optionally a lower case one or possibly two - e.g. Hg for mercury)
You can capture the element symbol and the number in groups like so:
So yes, in theory this would be something regular expressions could help with. If you're dealing with formulae like C6H2(NO2)3(CH3)3 then your job is of course a bit harder...
如果您只需要处理简单的情况,则使用正则表达式的解决方案是最好的方法。否则,您需要构建诸如 抽象语法树 之类的内容并对其进行评估或使用 波兰语表示法。
例如,TNT 公式
C6H2(NO2)3CH3
应表示为:The solution with regular expressions is the best approach if you need to handle only simple cases. Otherwise you need to build something like Abstract Syntax Tree and evaluate it or use Polish Notation.
For example, TNT formula
C6H2(NO2)3CH3
should be presented like:您是否考虑过用化学标记语言来表达化学式?它的用途非常广泛,有很多工具/查看器可以将这些化学公式或化合物以 2D 转换为 3D。
Have you looked into expressing your chemical formulas in Chemical Markup Language? It is very versatile and there are lot of tools/viewers out there that can render these chemical forumulas or compounds in 2D to 3D.
我正在开发一个需要计算化学式摩尔质量的程序,因此我创建了一个适用于各种公式的解决方案。
例如,“(CH3)16(Tc(H2O)3CO(BrFe3(ReCl)3(SO4)2)2)2MnO4”将生成“ 16C 48H 2Tc 12H 6O 2C 2O 4Br 12Fe 12Re 12Cl 8S 32O Mn 4O”(这个化合物是组成的,但是嘿,它有效!)
这段代码是用 C# 编写的,所以这就是我没有发布它的原因。如果您有兴趣我可以为您发布。在注意到 java 标签之前,我实际上已经写出了完整的答案。
不管怎样,它的工作原理基本上是递归地对由括号匹配的原子块进行分组。它不处理 2Pb 等系数(但 (Pb)2 或 Pb2 可以工作)或 OH- 等带电化合物。
它一点也不简单或优雅。我确实想要一个可行的解决方案,所以我知道有更好的方法(我什至从未尝试过正则表达式!)。但它适用于我需要的公式,也许它也适合你的。
这是我运行它的一些测试用例。查看它们并告诉我 C# 代码是否仍然对您有用。格式为(输入,预期输出)
I am working on a program that requires molar mass calculations of chemical formulas, so I have created a solution that works with for a variety of formulas.
For example, "(CH3)16(Tc(H2O)3CO(BrFe3(ReCl)3(SO4)2)2)2MnO4" will result in " 16C 48H 2Tc 12H 6O 2C 2O 4Br 12Fe 12Re 12Cl 8S 32O Mn 4O" (this compound is made up, but hey, it works!)
This code is written in C# so that's why I haven't posted it. If you're interested I can post it for you. I actually wrote out a full answer before noticing the java tag.
Anyway, it works by basically grouping blocks of atoms matched by parenthesis recursively. It does not handle coefficients such as 2Pb (but (Pb)2 or Pb2 does work) or charged compounds such as OH-.
In no way is it simple or elegant. I did want a working solution so I know there are better ways (I never even tried Regular expressions!). But it works with the formulas I need, maybe it suits yours as well.
Here are some test cases I run it on. Take a look at them and let me know if the C# code would still be useful to you. The format is (input, expected output)