从 C# 中的字符串解析化学式?
我正在尝试解析化学式(格式例如:Al2O3
或 O3
或 C
或 C11H22O12
) 在 C# 中来自字符串。除非某一特定元素只有一个原子(例如H2O
中的氧原子),否则它可以正常工作。我该如何解决这个问题,此外,是否有比我现在更好的方法来解析化学公式字符串?
ChemicalElement 是表示化学元素的类。它具有属性 AtomicNumber (int)、Name (string)、Symbol (string)。 ChemicalFormulaComponent 是表示化学元素和原子数(例如公式的一部分)的类。它具有属性 Element (ChemicalElement)、AtomCount (int)。
其余的应该足够清楚地理解(我希望),但如果我可以澄清任何事情,请在回答之前告诉我。
这是我当前的代码:
/// <summary>
/// Parses a chemical formula from a string.
/// </summary>
/// <param name="chemicalFormula">The string to parse.</param>
/// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
string nameBuffer = string.Empty;
int countBuffer = 0;
for (int i = 0; i < chemicalFormula.Length; i++)
{
char c = chemicalFormula[i];
if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
{
throw new FormatException("Input string was in an incorrect format.");
}
else if (char.IsUpper(c))
{
// Add the chemical element and its atom count
if (countBuffer > 0)
{
formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));
// Reset
nameBuffer = string.Empty;
countBuffer = 0;
}
nameBuffer += c;
}
else if (char.IsLower(c))
{
nameBuffer += c;
}
else if (char.IsDigit(c))
{
if (countBuffer == 0)
{
countBuffer = c - '0';
}
else
{
countBuffer = (countBuffer * 10) + (c - '0');
}
}
}
return formula;
}
I am trying to parse a chemical formula (in the format, for example: Al2O3
or O3
or C
or C11H22O12
) in C# from a string. It works fine unless there is only one atom of a particular element (e.g. the oxygen atom in H2O
). How can I fix that problem, and in addition, is there a better way to parse a chemical formula string than I am doing?
ChemicalElement is a class representing a chemical element. It has properties AtomicNumber (int), Name (string), Symbol (string).
ChemicalFormulaComponent is a class representing a chemical element and atom count (e.g. part of a formula). It has properties Element (ChemicalElement), AtomCount (int).
The rest should be clear enough to understand (I hope) but please let me know with a comment if I can clarify anything, before you answer.
Here is my current code:
/// <summary>
/// Parses a chemical formula from a string.
/// </summary>
/// <param name="chemicalFormula">The string to parse.</param>
/// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
string nameBuffer = string.Empty;
int countBuffer = 0;
for (int i = 0; i < chemicalFormula.Length; i++)
{
char c = chemicalFormula[i];
if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
{
throw new FormatException("Input string was in an incorrect format.");
}
else if (char.IsUpper(c))
{
// Add the chemical element and its atom count
if (countBuffer > 0)
{
formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));
// Reset
nameBuffer = string.Empty;
countBuffer = 0;
}
nameBuffer += c;
}
else if (char.IsLower(c))
{
nameBuffer += c;
}
else if (char.IsDigit(c))
{
if (countBuffer == 0)
{
countBuffer = c - '0';
}
else
{
countBuffer = (countBuffer * 10) + (c - '0');
}
}
}
return formula;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我使用正则表达式重写了你的解析器。正则表达式非常适合您正在做的事情。希望这有帮助。
I rewrote your parser using regular expressions. Regular expressions fit the bill perfectly for what you're doing. Hope this helps.
您的方法的问题在于:
当您没有数字时,计数缓冲区将为 0,我认为这会起作用
对于 HO2 或类似的公式,这将起作用。
我相信您的方法永远不会将化学式的 las 元素插入到
formula
集合中。在返回结果之前,您应该将缓冲区的最后一个元素添加到集合中,如下所示:
The problem with your method is here:
When you don't have a number, count buffer will be 0, I think this will work
This will work when for formulas like HO2 or something like that.
I believe that your method will never insert into the
formula
collection the las element of the chemical formula.You should add the last element of the bufer to the collection before return the result, like this:
首先:我没有在 .net 中使用过解析器生成器,但我很确定您可以找到合适的东西。这将使您能够以更易读的形式编写化学式的语法。例如,请参阅这个问题作为第一次开始。
如果您想保留您的方法:是否有可能不添加最后一个元素,无论它是否有数字?您可能希望使用
i<=chemicalFormula.Length
运行循环,并且在i==chemicalFormula.Length
的情况下还将您所拥有的内容添加到公式中。然后,您还必须删除if (countBuffer > 0)
条件,因为 countBuffer 实际上可能为零!first of all: I haven't used a parser generator in .net, but I'm pretty sure you could find something appropriate. This would allow you to write the grammar of Chemical Formulas in a far more readable form. See for example this question for a first start.
If you want to keep your approach: Is it possible that you do not add your last element no matter if it has a number or not? You might want to run your loop with
i<= chemicalFormula.Length
and in case ofi==chemicalFormula.Length
also add what you have to your Formula. You then also have to remove yourif (countBuffer > 0)
condition because countBuffer can actually be zero!正则表达式应该可以很好地处理简单的公式,如果您想拆分类似的内容:
使用解析器可能会更容易(因为复合嵌套)。任何解析器都应该能够处理它。
几天前我发现了这个问题,我认为这将是一个很好的例子,如何为解析器编写语法,因此我将简单的化学公式语法包含在我的 NLT 套件。 关键规则是 -- 对于词法分析器:
和对于解析器:
Regex should work fine with simple formula, if you want to split something like:
it might be easier to use the parser for it (because of compound nesting). Any parser should be capable of handling it.
I spotted this problem few days ago I thought it would be good example how one can write grammar for a parser, so I included simple chemical formula grammar into my NLT suite. The key rules are -- for lexer:
and for parser: