从 C# 中的字符串解析化学式?

发布于 2024-09-30 19:44:01 字数 2297 浏览 6 评论 0原文

我正在尝试解析化学式(格式例如:Al2O3O3CC11H22O12) 在 C# 中来自字符串。除非某一特定元素只有一个原子(例如H2O 中的氧原子),否则它可以正常工作。我该如何解决这个问题,此外,是否有比我现在更好的方法来解析化学公式字符串?

ChemicalElement 是表示化学元素的类。它具有属性 AtomicNumber (int)、Name (string)、Symbol (string)。 ChemicalFormulaComponent 是表示化学元素和原子数(例如公式的一部分)的类。它具有属性 Element (ChemicalElement)、AtomCount (int)。

其余的应该足够清楚地理解(我希望),但如果我可以澄清任何事情,请在回答之前告诉我。

这是我当前的代码:

    /// <summary>
    /// Parses a chemical formula from a string.
    /// </summary>
    /// <param name="chemicalFormula">The string to parse.</param>
    /// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
    public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
    {
        Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();

        string nameBuffer = string.Empty;
        int countBuffer = 0;

        for (int i = 0; i < chemicalFormula.Length; i++)
        {
            char c = chemicalFormula[i];

            if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
            {
                throw new FormatException("Input string was in an incorrect format.");
            }
            else if (char.IsUpper(c))
            {
                // Add the chemical element and its atom count
                if (countBuffer > 0)
                {
                    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

                    // Reset
                    nameBuffer = string.Empty;
                    countBuffer = 0;
                }

                nameBuffer += c;
            }
            else if (char.IsLower(c))
            {
                nameBuffer += c;
            }
            else if (char.IsDigit(c))
            {
                if (countBuffer == 0)
                {
                    countBuffer = c - '0';
                }
                else
                {
                    countBuffer = (countBuffer * 10) + (c - '0');
                }
            }
        }

        return formula;
    }

I am trying to parse a chemical formula (in the format, for example: Al2O3 or O3 or C or C11H22O12) in C# from a string. It works fine unless there is only one atom of a particular element (e.g. the oxygen atom in H2O). How can I fix that problem, and in addition, is there a better way to parse a chemical formula string than I am doing?

ChemicalElement is a class representing a chemical element. It has properties AtomicNumber (int), Name (string), Symbol (string).
ChemicalFormulaComponent is a class representing a chemical element and atom count (e.g. part of a formula). It has properties Element (ChemicalElement), AtomCount (int).

The rest should be clear enough to understand (I hope) but please let me know with a comment if I can clarify anything, before you answer.

Here is my current code:

    /// <summary>
    /// Parses a chemical formula from a string.
    /// </summary>
    /// <param name="chemicalFormula">The string to parse.</param>
    /// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
    public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
    {
        Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();

        string nameBuffer = string.Empty;
        int countBuffer = 0;

        for (int i = 0; i < chemicalFormula.Length; i++)
        {
            char c = chemicalFormula[i];

            if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
            {
                throw new FormatException("Input string was in an incorrect format.");
            }
            else if (char.IsUpper(c))
            {
                // Add the chemical element and its atom count
                if (countBuffer > 0)
                {
                    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

                    // Reset
                    nameBuffer = string.Empty;
                    countBuffer = 0;
                }

                nameBuffer += c;
            }
            else if (char.IsLower(c))
            {
                nameBuffer += c;
            }
            else if (char.IsDigit(c))
            {
                if (countBuffer == 0)
                {
                    countBuffer = c - '0';
                }
                else
                {
                    countBuffer = (countBuffer * 10) + (c - '0');
                }
            }
        }

        return formula;
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

征﹌骨岁月お 2024-10-07 19:44:01

我使用正则表达式重写了你的解析器。正则表达式非常适合您正在做的事情。希望这有帮助。

public static void Main(string[] args)
{
    var testCases = new List<string>
    {
        "C11H22O12",
        "Al2O3",
        "O3",
        "C",
        "H2O"
    };

    foreach (string testCase in testCases)
    {
        Console.WriteLine("Testing {0}", testCase);

        var formula = FormulaFromString(testCase);

        foreach (var element in formula)
        {
            Console.WriteLine("{0} : {1}", element.Element, element.Count);
        }
        Console.WriteLine();
    }

    /* Produced the following output

    Testing C11H22O12
    C : 11
    H : 22
    O : 12

    Testing Al2O3
    Al : 2
    O : 3

    Testing O3
    O : 3

    Testing C
    C : 1

    Testing H2O
    H : 2
    O : 1
        */
}

private static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
    Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
    string elementRegex = "([A-Z][a-z]*)([0-9]*)";
    string validateRegex = "^(" + elementRegex + ")+$";

    if (!Regex.IsMatch(chemicalFormula, validateRegex))
        throw new FormatException("Input string was in an incorrect format.");

    foreach (Match match in Regex.Matches(chemicalFormula, elementRegex))
    {
        string name = match.Groups[1].Value;

        int count =
            match.Groups[2].Value != "" ?
            int.Parse(match.Groups[2].Value) :
            1;

        formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(name), count));
    }

    return formula;
}

I rewrote your parser using regular expressions. Regular expressions fit the bill perfectly for what you're doing. Hope this helps.

public static void Main(string[] args)
{
    var testCases = new List<string>
    {
        "C11H22O12",
        "Al2O3",
        "O3",
        "C",
        "H2O"
    };

    foreach (string testCase in testCases)
    {
        Console.WriteLine("Testing {0}", testCase);

        var formula = FormulaFromString(testCase);

        foreach (var element in formula)
        {
            Console.WriteLine("{0} : {1}", element.Element, element.Count);
        }
        Console.WriteLine();
    }

    /* Produced the following output

    Testing C11H22O12
    C : 11
    H : 22
    O : 12

    Testing Al2O3
    Al : 2
    O : 3

    Testing O3
    O : 3

    Testing C
    C : 1

    Testing H2O
    H : 2
    O : 1
        */
}

private static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
    Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
    string elementRegex = "([A-Z][a-z]*)([0-9]*)";
    string validateRegex = "^(" + elementRegex + ")+$";

    if (!Regex.IsMatch(chemicalFormula, validateRegex))
        throw new FormatException("Input string was in an incorrect format.");

    foreach (Match match in Regex.Matches(chemicalFormula, elementRegex))
    {
        string name = match.Groups[1].Value;

        int count =
            match.Groups[2].Value != "" ?
            int.Parse(match.Groups[2].Value) :
            1;

        formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(name), count));
    }

    return formula;
}
兰花执着 2024-10-07 19:44:01

您的方法的问题在于:

            // Add the chemical element and its atom count
            if (countBuffer > 0)

当您没有数字时,计数缓冲区将为 0,我认为这会起作用

            // Add the chemical element and its atom count
            if (countBuffer > 0 || nameBuffer != String.Empty)

对于 HO2 或类似的公式,这将起作用。
我相信您的方法永远不会将化学式的 las 元素插入到 formula 集合中。

在返回结果之前,您应该将缓冲区的最后一个元素添加到集合中,如下所示:

    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

    return formula;
}

The problem with your method is here:

            // Add the chemical element and its atom count
            if (countBuffer > 0)

When you don't have a number, count buffer will be 0, I think this will work

            // Add the chemical element and its atom count
            if (countBuffer > 0 || nameBuffer != String.Empty)

This will work when for formulas like HO2 or something like that.
I believe that your method will never insert into the formula collection the las element of the chemical formula.

You should add the last element of the bufer to the collection before return the result, like this:

    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

    return formula;
}
夜司空 2024-10-07 19:44:01

首先:我没有在 .net 中使用过解析器生成器,但我很确定您可以找到合适的东西。这将使您能够以更易读的形式编写化学式的语法。例如,请参阅这个问题作为第一次开始。

如果您想保留您的方法:是否有可能不添加最后一个元素,无论它是否有数字?您可能希望使用 i<=chemicalFormula.Length 运行循环,并且在 i==chemicalFormula.Length 的情况下还将您所拥有的内容添加到公式中。然后,您还必须删除 if (countBuffer > 0) 条件,因为 countBuffer 实际上可能为零!

first of all: I haven't used a parser generator in .net, but I'm pretty sure you could find something appropriate. This would allow you to write the grammar of Chemical Formulas in a far more readable form. See for example this question for a first start.

If you want to keep your approach: Is it possible that you do not add your last element no matter if it has a number or not? You might want to run your loop with i<= chemicalFormula.Length and in case of i==chemicalFormula.Length also add what you have to your Formula. You then also have to remove your if (countBuffer > 0) condition because countBuffer can actually be zero!

在巴黎塔顶看东京樱花 2024-10-07 19:44:01

正则表达式应该可以很好地处理简单的公式,如果您想拆分类似的内容:

(Zn2(Ca(BrO4))K(Pb)2Rb)3

使用解析器可能会更容易(因为复合嵌套)。任何解析器都应该能够处理它。

几天前我发现了这个问题,我认为这将是一个很好的例子,如何为解析器编写语法,因此我将简单的化学公式语法包含在我的 NLT 套件。 关键规则是 -- 对于词法分析器:

"(" -> LPAREN;
")" -> RPAREN;

/[0-9]+/ -> NUM, Convert.ToInt32($text);
/[A-Z][a-z]*/ -> ATOM;

和对于解析器:

comp -> e:elem { e };

elem -> LPAREN e:elem RPAREN n:NUM? { new Element(e,$(n : 1)) }
      | e:elem++ { new Element(e,1) }
      | a:ATOM n:NUM? { new Element(a,$(n : 1)) }
      ;

Regex should work fine with simple formula, if you want to split something like:

(Zn2(Ca(BrO4))K(Pb)2Rb)3

it might be easier to use the parser for it (because of compound nesting). Any parser should be capable of handling it.

I spotted this problem few days ago I thought it would be good example how one can write grammar for a parser, so I included simple chemical formula grammar into my NLT suite. The key rules are -- for lexer:

"(" -> LPAREN;
")" -> RPAREN;

/[0-9]+/ -> NUM, Convert.ToInt32($text);
/[A-Z][a-z]*/ -> ATOM;

and for parser:

comp -> e:elem { e };

elem -> LPAREN e:elem RPAREN n:NUM? { new Element(e,$(n : 1)) }
      | e:elem++ { new Element(e,1) }
      | a:ATOM n:NUM? { new Element(a,$(n : 1)) }
      ;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文