如何用 C# 编写解析器?

发布于 2024-12-03 20:50:31 字数 1435 浏览 1 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

断爱 2024-12-10 20:50:31

我已经用 C# 实现了几个解析器 - 手写的和工具生成的。

关于解析的一个非常好的入门教程是 让我们构建一个编译器 - 它演示了如何构建递归下降解析器;对于任何有能力的开发人员来说,这些概念很容易从他的语言(我认为是 Pascal)翻译成 C#。这将教您递归下降解析器如何工作,但是手动编写完整的编程语言解析器是完全不切实际的。

如果您决定编写经典递归下降解析器<,您应该研究一些工具来为您生成代码/a> (TinyPG, Coco/RTDOP解析Monadic解析)。

关于 C# 是否适合这项任务的主题 - C# 拥有一些最好的文本库。今天的许多解析器(在其他语言中)都有大量的代码来处理 Unicode 等。我不会对 JITted 代码发表太多评论,因为它可能变得非常宗教 - 不过你应该没问题。 IronJS 是 CLR 上的解析器/运行时(即使它是用 F# 编写的)及其性能的一个很好的示例只比 Google V8 差一点点。

旁注:与语言解析器相比,标记解析器是完全不同的野兽 - 在大多数情况下,它们是手工编写的 - 并且在扫描器/解析器级别非常简单;它们通常不是递归下降 - 特别是在 XML 的情况下,最好不要编写递归下降解析器(以避免堆栈溢出,并且因为“平面”解析器可以在 SAX/push 模式下使用)。

I have implemented several parsers in C# - hand-written and tool generated.

A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.

You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).

On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.

Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).

叶落知秋 2024-12-10 20:50:31

Sprache 是一个强大而轻量级的框架,用于在 .NET 中编写解析器。还有一个 Sprache NuGet 包。为了让您了解这里的框架,我们提供了示例之一< /a> 可以将简单的算术表达式解析为 .NET 表达式树。我想说真是太棒了。

using System;
using System.Linq.Expressions;
using Sprache;

namespace LinqyCalculator
{
    static class ExpressionParser
    {
        public static Expression<Func<decimal>> ParseExpression(string text)
        {
            return Lambda.Parse(text);
        }

        static Parser<ExpressionType> Operator(string op, ExpressionType opType)
        {
            return Parse.String(op).Token().Return(opType);
        }

        static readonly Parser<ExpressionType> Add = Operator("+", ExpressionType.AddChecked);
        static readonly Parser<ExpressionType> Subtract = Operator("-", ExpressionType.SubtractChecked);
        static readonly Parser<ExpressionType> Multiply = Operator("*", ExpressionType.MultiplyChecked);
        static readonly Parser<ExpressionType> Divide = Operator("/", ExpressionType.Divide);

        static readonly Parser<Expression> Constant =
            (from d in Parse.Decimal.Token()
             select (Expression)Expression.Constant(decimal.Parse(d))).Named("number");

        static readonly Parser<Expression> Factor =
            ((from lparen in Parse.Char('(')
              from expr in Parse.Ref(() => Expr)
              from rparen in Parse.Char(')')
              select expr).Named("expression")
             .XOr(Constant)).Token();

        static readonly Parser<Expression> Term = Parse.ChainOperator(Multiply.Or(Divide), Factor, Expression.MakeBinary);

        static readonly Parser<Expression> Expr = Parse.ChainOperator(Add.Or(Subtract), Term, Expression.MakeBinary);

        static readonly Parser<Expression<Func<decimal>>> Lambda =
            Expr.End().Select(body => Expression.Lambda<Func<decimal>>(body));
    }
}

Sprache is a powerful yet lightweight framework for writing parsers in .NET. There is also a Sprache NuGet package. To give you an idea of the framework here is one of the samples that can parse a simple arithmetic expression into an .NET expression tree. Pretty amazing I would say.

using System;
using System.Linq.Expressions;
using Sprache;

namespace LinqyCalculator
{
    static class ExpressionParser
    {
        public static Expression<Func<decimal>> ParseExpression(string text)
        {
            return Lambda.Parse(text);
        }

        static Parser<ExpressionType> Operator(string op, ExpressionType opType)
        {
            return Parse.String(op).Token().Return(opType);
        }

        static readonly Parser<ExpressionType> Add = Operator("+", ExpressionType.AddChecked);
        static readonly Parser<ExpressionType> Subtract = Operator("-", ExpressionType.SubtractChecked);
        static readonly Parser<ExpressionType> Multiply = Operator("*", ExpressionType.MultiplyChecked);
        static readonly Parser<ExpressionType> Divide = Operator("/", ExpressionType.Divide);

        static readonly Parser<Expression> Constant =
            (from d in Parse.Decimal.Token()
             select (Expression)Expression.Constant(decimal.Parse(d))).Named("number");

        static readonly Parser<Expression> Factor =
            ((from lparen in Parse.Char('(')
              from expr in Parse.Ref(() => Expr)
              from rparen in Parse.Char(')')
              select expr).Named("expression")
             .XOr(Constant)).Token();

        static readonly Parser<Expression> Term = Parse.ChainOperator(Multiply.Or(Divide), Factor, Expression.MakeBinary);

        static readonly Parser<Expression> Expr = Parse.ChainOperator(Add.Or(Subtract), Term, Expression.MakeBinary);

        static readonly Parser<Expression<Func<decimal>>> Lambda =
            Expr.End().Select(body => Expression.Lambda<Func<decimal>>(body));
    }
}
愛放△進行李 2024-12-10 20:50:31

C# 几乎是一种不错的函数式语言,因此在其中实现 Parsec 之类的东西并不是什么大不了的事情。以下是如何执行此操作的示例之一: http://jparsec.codehaus.org/NParsec+Tutorial

还可以实现基于组合器的 Packrat ,在一个非常相似的方式,但这次在某处保留全局解析状态,而不是执行纯函数式的操作。在我的(非常基本和临时的)实现中,它相当快,但是当然像 这样的代码生成器这一定表现得更好。

C# is almost a decent functional language, so it is not such a big deal to implement something like Parsec in it. Here is one of the examples of how to do it: http://jparsec.codehaus.org/NParsec+Tutorial

It is also possible to implement a combinator-based Packrat, in a very similar way, but this time keeping a global parsing state somewhere instead of doing a pure functional stuff. In my (very basic and ad hoc) implementation it was reasonably fast, but of course a code generator like this must perform better.

岛歌少女 2024-12-10 20:50:31

我知道我有点晚了,但我刚刚发布了一个名为 Ve Parser 的解析器/语法/AST 生成器库。您可以在 http://veparser.codeplex.com 找到它,或者通过输入“Install-Package”将其添加到您的项目中veparser' 在包管理器控制台中。该库是一种递归下降解析器,旨在易于使用且灵活。由于您可以获得其源代码,因此您可以从其源代码中学习。我希望它有帮助。

I know that I am a little late, but I just published a parser/grammar/AST generator library named Ve Parser. you can find it at http://veparser.codeplex.com or add to your project by typing 'Install-Package veparser' in Package Manager Console. This library is kind of Recursive Descent Parser that is intended to be easy to use and flexible. As its source is available to you, you can learn from its source codes. I hope it helps.

记忆之渊 2024-12-10 20:50:31

在我看来,有一种比传统方法更好的方法来实现解析器,它可以使代码更简单、更容易理解,尤其是通过在一个非常对象中插入一个新类,可以更轻松地扩展您正在解析的任何语言 -导向的方式。我写的一个较大系列的一篇文章重点介绍了这种解析方法,并且包含 C# 2.0 解析器的完整源代码:
http://www.codeproject.com/Articles /492466/面向对象解析打破传统-Pa

In my opinion, there is a better way to implement parsers than the traditional methods that results in simpler and easier to understand code, and especially makes it easier to extend whatever language you are parsing by just plugging in a new class in a very object-oriented way. One article of a larger series that I wrote focuses on this parsing method, and full source code is included for a C# 2.0 parser:
http://www.codeproject.com/Articles/492466/Object-Oriented-Parsing-Breaking-With-Tradition-Pa

还在原地等你 2024-12-10 20:50:31

嗯...从哪里开始...

首先,编写一个解析器,这是一个非常广泛的陈述,特别是对于您提出的问题。

您的开场白是您想要一个简单的算术“解析器”,从技术上讲,这不是解析器,而是词法分析器,类似于您可能用于创建新语言的分析器。 ( http://en.wikipedia.org/wiki/Lexical_analysis )我确切地理解混淆它们可能是同一件事。需要注意的是,如果您也打算编写语言/脚本解析器,那么词法分析也是您想要理解的,这严格来说不是解析,因为您正在解释指令而不是使用它们。

回到解析问题......

如果您采用严格定义的文件结构来从中提取信息,这就是您将要做的事情。

一般来说,您实际上不必为 XML / HTML 编写解析器,因为它们已经有很多了,如果您解析 .NET 运行时生成的 XML,那么您甚至不需要解析,你只需要“序列化”和“反序列化”。

然而,出于学习的目的,在大多数情况下解析 XML(或任何类似的内容,如 html)是非常简单的。

如果我们从以下 XML 开始:

    <movies>
      <movie id="1">
        <name>Tron</name>
      </movie>
      <movie id="2">
        <name>Tron Legacy</name>
      </movie>
    <movies>

我们可以将数据加载到 XElement 中,如下所示:

    XElement myXML = XElement.Load("mymovies.xml");

然后您可以使用 'myXML.Root' 获取 'movies' 根元素

然而,更有趣的是,您可以使用 Linq 轻松获取嵌套标签:

    var myElements = from p in myXML.Root.Elements("movie")
                     select p;

将为您提供一个 XElements var,每个 XElements 包含一个“...”,您可以使用以下内容获得它:

    foreach(var v in myElements)
    {
      Console.WriteLine(string.Format("ID {0} = {1}",(int)v.Attributes["id"],(string)v.Element("movie"));
    }

对于除 XML 之类的数据结构之外的任何其他内容,则恐怕你会有要开始学习正则表达式的艺术,像“正则表达式教练”这样的工具将为您提供巨大的帮助(http:// weitz.de/regex-coach/ )或更新的类似工具之一。

您还需要熟悉 .NET 正则表达式对象,( http:// www.codeproject.com/KB/dotnet/regextutorial.aspx )应该会给您一个良好的开端。

一旦您了解了 reg-ex 的工作原理,那么在大多数情况下,这就是一个简单的案例,一次一行地读取文件,并使用您觉得舒服的方法来理解它们。

您可以在 ( http://www.wotsit.org/ 找到几乎所有您能想到的文件格式的良好免费来源)

Well... where to start with this one....

First off, writing a parser, well that's a very broad statement especially with the question your asking.

Your opening statement was that you wanted a simple arithmatic "parser" , well technically that's not a parser, it's a lexical analyzer, similar to what you may use for creating a new language. ( http://en.wikipedia.org/wiki/Lexical_analysis ) I understand however exactly where the confusion of them being the same thing may come from. It's important to note, that Lexical analysis is ALSO what you'll want to understand if your going to write language/script parsers too, this is strictly not parsing because you are interpreting the instructions as opposed to making use of them.

Back to the parsing question....

This is what you'll be doing if your taking a rigidly defined file structure to extract information from it.

In general you really don't have to write a parser for XML / HTML, beacuse there are already a ton of them around, and more so if your parsing XML produced by the .NET run time, then you don't even need to parse, you just need to "serialise" and "de-serialise".

In the interests of learning however, parsing XML (Or anything similar like html) is very straight forward in most cases.

if we start with the following XML:

    <movies>
      <movie id="1">
        <name>Tron</name>
      </movie>
      <movie id="2">
        <name>Tron Legacy</name>
      </movie>
    <movies>

we can load the data into an XElement as follows:

    XElement myXML = XElement.Load("mymovies.xml");

you can then get at the 'movies' root element using 'myXML.Root'

MOre interesting however, you can use Linq easily to get the nested tags:

    var myElements = from p in myXML.Root.Elements("movie")
                     select p;

Will give you a var of XElements each containing one '...' which you can get at using somthing like:

    foreach(var v in myElements)
    {
      Console.WriteLine(string.Format("ID {0} = {1}",(int)v.Attributes["id"],(string)v.Element("movie"));
    }

For anything else other than XML like data structures, then I'm afraid your going to have to start learning the art of regular expressions, a tool like "Regular Expression Coach" will help you imensly ( http://weitz.de/regex-coach/ ) or one of the more uptodate similar tools.

You'll also need to become familiar with the .NET regular expression objects, ( http://www.codeproject.com/KB/dotnet/regextutorial.aspx ) should give you a good head start.

Once you know how your reg-ex stuff works then in most cases it's a simple case case of reading in the files one line at a time and making sense of them using which ever method you feel comfortable with.

A good free source of file formats for almost anything you can imagine can be found at ( http://www.wotsit.org/ )

小苏打饼 2024-12-10 20:50:31

根据记录,我在 C# 中实现了解析器生成器,只是因为我找不到任何正常工作或类似于 YACC 的东西(请参阅: http://sourceforge.net/projects/naivelangtools/)。

然而,在获得了 ANTLR 的一些经验后,我决定使用 LALR 而不是 LL。我知道理论上 LL 更容易实现(生成器或解析器),但我根本无法忍受表达式堆栈只是为了表达运算符的优先级(例如 *+ 之前) “2+5*3”)。在 LL 中,你说 mult_expr 嵌入在 add_expr 中,这对我来说似乎不自然。

For the record I implemented parser generator in C# just because I couldn't find any working properly or similar to YACC (see: http://sourceforge.net/projects/naivelangtools/).

However after some experience with ANTLR I decided to go with LALR instead of LL. I know that theoretically LL is easier to implement (generator or parser) but I simply cannot live with stack of expressions just to express priorities of operators (like * goes before + in "2+5*3"). In LL you say that mult_expr is embedded inside add_expr which does not seem natural for me.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文