哪些编程语言是上下文无关的?
或者,更准确地说:哪些编程语言是由上下文无关语法定义的?
据我所知,由于宏和模板之类的原因,C++ 并不是上下文无关的。 我的直觉告诉我,函数式语言可能是上下文无关的,但我没有任何硬数据来支持这一点。
简洁示例的额外代表:-)
Or, to be a little more precise: which programming languages are defined by a context-free grammar?
From what I gather C++ is not context-free due to things like macros and templates. My gut tells me that functional languages might be context free, but I don't have any hard data to back that up with.
Extra rep for concise examples :-)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
简短版本:现实世界中几乎没有任何编程语言是上下文无关的这个词的任何含义。 一种语言是否与上下文无关与其功能性无关。 这只是语法复杂程度的问题。
这是命令式语言 Brainfuck 的 CFG:
这是函数 SKI 组合器演算:
这些 CFG 识别所有< /em> 两种语言的有效程序,因为它们非常简单。
较长的版本:通常,上下文无关语法(CFG)仅用于粗略指定语言的语法。 人们必须区分语法正确的程序和正确编译/评估的程序。 最常见的是,编译器将语言分析分为构建和验证一段代码的一般结构的语法分析和验证含义的语义分析> 的程序。
如果“上下文无关语言”的意思是“...所有程序都为其编译”,那么答案是:几乎没有。 符合这一要求的语言几乎没有任何规则或复杂的功能,例如变量的存在、空白敏感性、类型系统或任何其他上下文:在一个地方定义并在另一个地方依赖的信息。
另一方面,如果“上下文无关语言”仅意味着“...所有程序都通过语法分析”,那么答案就在于语法本身有多复杂。 有许多句法特征很难或不可能仅用 CFG 来描述。 其中一些问题可以通过向解析器添加额外的状态来跟踪计数器、查找表等来克服。
无法使用 CFG 表达的语法特征示例:
对缩进和空格敏感的语言,例如 Python 和 Haskell。 跟踪任意嵌套的缩进级别本质上是上下文相关的,并且需要单独的缩进级别计数器; 每个级别使用多少空间以及有多少级别。
通过为每个缩进级别重复语法,仅允许使用固定数量的空格进行固定级别的缩进,但实际上这很不方便。
C Typedef 解析问题 说 C程序在词法分析期间是不明确的,因为它无法仅从语法中知道某物是否是现有类型的常规标识符或 typedef 别名。
示例是:
<块引用>
<块引用>
在分号处,需要使用 my_int 条目更新类型环境。 但是,如果词法分析器已经预先查找 my_int,它会将其词法分析为标识符而不是类型名称。
在上下文无关语法术语中,在
my_int
上触发的X → ...
规则是不明确的:它可以是生成标识符的规则,也可以是生成标识符的规则。一个产生typedef
'ed 类型的类型; 知道哪一个依赖于语法本身之外的查找表(上下文)。基于宏和模板的语言,例如 Lisp、C++、Template Haskell、Nim 等。 由于语法在解析时会发生变化,因此一种解决方案是将解析器变成一个自修改程序。 另请参阅 C++ 是上下文无关的还是上下文敏感的?
通常,运算符优先级和结合性不会直接在 CFG 中表达,尽管它是可能的。 例如,小表达式语法的 CFG(其中 ^ 比 × 绑定更紧密,× 比 + 绑定更紧密)可能如下所示:
<前><代码> E → E ^ E
E→E×E
E → E + E
E → (E)
E → 数字
但是,此 CFG 不明确,并且通常伴随有 优先级/结合性表 表示 ^ 结合最紧密,× 比 + 结合更紧密,^ 是右结合,并且 × 和 + 是左关联的。
优先级和关联性可以以机械方式编码到 CFG 中,这样它就不会产生歧义,并且仅在运算符行为正确的情况下生成语法树。 上述语法的示例:
<前><代码> E₀ → EA E₁
EA → E₁ + EA
EA → ε
E₁ → EM E2
EM → E2 × EM
EM → ε
E2 → E₃ EP
EP → ^ E₃ EP
E₃ → 数字
E₃ → (E₀)
但是不明确的 CFG + 优先级/关联性表很常见,因为它们更具可读性,并且因为各种类型的 LR解析器生成器库可以通过 消除移位/归约冲突,而不是处理更大尺寸的明确的、转换的语法。
理论上,所有有限的字符串集都是正则语言,因此所有有界大小的合法程序都是正则的。 由于常规语言是上下文无关语言的子集,因此所有有限大小的程序都是上下文无关的。 争论还在继续,
CFG 也是如此。 为了以不同的方式解决你的子问题,
大多数现实世界的编程语言都是由其实现来定义的,并且大多数现实世界编程语言的解析器要么是手写的,要么使用扩展上下文无关解析的解析器生成器。 遗憾的是,为您喜欢的语言找到准确的 CFG 并不常见。 当您这样做时,通常采用 Backus-Naur 形式 (BNF),或者很可能不是纯粹上下文无关的解析器规范。
来自野外的语法规范示例:
The short version: There are hardly any real-world programming languages that are context-free in any meaning of the word. Whether a language is context-free or not has nothing to do with it being functional. It is simply a matter of how complex the syntax is.
Here's a CFG for the imperative language Brainfuck:
And here's a CFG for the functional SKI combinator calculus:
These CFGs recognize all valid programs of the two languages because they're so simple.
The longer version: Usually, context-free grammars (CFGs) are only used to roughly specify the syntax of a language. One must distinguish between syntactically correct programs and programs that compile/evaluate correctly. Most commonly, compilers split language analysis into syntax analysis that builds and verifies the general structure of a piece of code, and semantic analysis that verifies the meaning of the program.
If by "context-free language" you mean "... for which all programs compile", then the answer is: hardly any. Languages that fit this bill hardly have any rules or complicated features, like the existence of variables, whitespace-sensitivity, a type system, or any other context: Information defined in one place and relied upon in another.
If, on the other hand, "context-free language" only means "... for which all programs pass syntax analysis", the answer is a matter of how complex the syntax alone is. There are many syntactic features that are hard or impossible to describe with a CFG alone. Some of these are overcome by adding additional state to parsers for keeping track of counters, lookup tables, and so on.
Examples of syntactic features that are not possible to express with a CFG:
Indentation- and whitespace-sensitive languages like Python and Haskell. Keeping track of arbitrarily nested indentation levels is essentially context-sensitive and requires separate counters for the indentation level; both how many spaces that are used for each level and how many levels there are.
Allowing only a fixed level of indentation using a fixed amount of spaces would work by duplicating the grammar for each level of indentation, but in practice this is inconvenient.
The C Typedef Parsing Problem says that C programs are ambiguous during lexical analysis because it cannot know from the grammar alone if something is a regular identifier or a typedef alias for an existing type.
The example is:
In context-free grammar terms, the
X → ...
rule that would trigger onmy_int
is ambiguous: It could be either one that produces an identifier, or one that produces atypedef
'ed type; knowing which one relies on a lookup table (context) beyond the grammar itself.Macro- and template-based languages like Lisp, C++, Template Haskell, Nim, and so on. Since the syntax changes as it is being parsed, one solution is to make the parser into a self-modifying program. See also Is C++ context-free or context-sensitive?
Often, operator precedence and associativity are not expressed directly in CFGs even though it is possible. For example, a CFG for a small expression grammar where ^ binds tighter than ×, and × binds tighter than +, might look like this:
This CFG is ambiguous, however, and is often accompanied by a precedence / associativity table saying e.g. that ^ binds tightest, × binds tighter than +, that ^ is right-associative, and that × and + are left-associative.
Precedence and associativity can be encoded into a CFG in a mechanical way such that it is unambiguous and only produces syntax trees where the operators behave correctly. An example of this for the grammar above:
But ambiguous CFGs + precedence / associativity tables are common because they're more readable and because various types of LR parser generator libraries can produce more efficient parsers by eliminating shift/reduce conflicts instead of dealing with an unambiguous, transformed grammar of a larger size.
In theory, all finite sets of strings are regular languages, and so all legal programs of bounded size are regular. Since regular languages are a subset of context-free languages, all programs of bounded size are context-free. The argument continues,
The same goes for CFGs. To address your sub-question a little differently,
Most real-world programming languages are defined by their implementations, and most parsers for real-world programming languages are either hand-written or uses a parser generator that extends context-free parsing. It is unfortunately not that common to find an exact CFG for your favourite language. When you do, it's usually in Backus-Naur form (BNF), or a parser specification that most likely isn't purely context-free.
Examples of grammar specifications from the wild:
对于几乎所有语言来说,语法正确的程序集都是上下文无关的。
对于几乎所有语言来说,编译的程序集都不是上下文无关的。 例如,如果所有编译 C 程序的集合是上下文无关的,那么通过与正则语言(也称为正则表达式)相交,所有匹配的编译 C 程序的集合
将是上下文无关的,但这显然是同构的到语言 a^kba^kba^k,众所周知,它不是上下文无关的。
The set of programs that are syntactically correct is context-free for almost all languages.
The set of programs that compile is not context-free for almost all languages. For example, if the set of all compiling C programs were context free, then by intersecting with a regular language (also known as a regex), the set of all compiling C programs that match
would be context-free, but this is clearly isomorphic to the language a^kba^kba^k, which is well-known not to be context-free.
根据您对问题的理解方式,答案会发生变化。 但恕我直言,正确的答案是所有现代编程语言实际上都是上下文相关的。 例如,不存在仅接受语法正确的 C 程序的上下文无关语法。 那些指出 C 的 yacc/bison 上下文无关语法的人没有抓住重点。
Depending on how you understand the question, the answer changes. But IMNSHO, the proper answer is that all modern programming languages are in fact context sensitive. For example there is no context free grammar that accepts only syntactically correct C programs. People who point to yacc/bison context free grammars for C are missing the point.
举一个非上下文无关语法的最戏剧性的例子,据我所知,Perl 的语法是 图灵-完成。
To go for the most dramatic example of a non-context-free grammar, Perl's grammar is, as I understand it, turing-complete.
如果我理解你的问题,那么你正在寻找可以通过上下文无关语法(cfg)描述的编程语言,以便 cfg 生成所有有效的程序并且仅生成有效的程序。
因此,我相信大多数(如果不是全部)现代编程语言都不是上下文无关的。 例如,一旦您拥有用户定义的类型(在现代语言中非常常见),您就会自动对上下文敏感。
验证程序的语法和验证语义正确性之间存在差异。 检查语法与上下文无关,而检查语义正确性则不然(同样,在大多数语言中)。
然而,这并不意味着这样的语言不存在。 例如,非类型化 lambda 演算 可以使用上下文无关语法来描述,并且当然是,图灵完备。
If I understand your question, you are looking for programming languages which can be described by context free grammars (cfg) so that the cfg generates all valid programs and only valid programs.
I believe that most (if not all) modern programming languages are therefore not context free. For example, once you have user defined types (very common in modern languages) you are automatically context sensitive.
There is a difference between verifying syntax and verifying semantic correctness of a program. Checking syntax is context free, whereas checking semantic correctness isn't (again, in most languages).
This, however, does not mean that such a language cannot exist. Untyped lambda calculus, for example, can be described using a context free grammar, and is, of course, Turing complete.
大多数现代编程语言都不是上下文无关语言。 作为证明,如果我深入研究 CFL 的根源,其相应的机器 PDA 无法处理像
{ww | } 这样的字符串匹配。 w 是一个字符串}
。 所以大多数编程语言都需要这样做。例子:
Most of the modern programming languages are not context-free languages. As a proof, if I delve into the root of CFL its corresponding machine PDA can't process string matchings like
{ww | w is a string}
. So most programming languages require that.Example:
VHDL 在某种程度上是上下文相关的:
(Eli Bendersky,“解析 VHDL [非常] 困难”,2009)
VHDL is somewhat context sensitive:
(Eli Bendersky, “Parsing VHDL is [very] hard”, 2009)
让我们以 Swift 为例,用户可以定义运算符,包括运算符优先级和关联性。 例如,运算符 + 和 * 实际上是在标准库中定义的。
上下文无关语法和词法分析器可能能够解析 a + b - c * d + e,但语义是“五个操作数 a、b、c、d 和 e,由运算符 +、-、* 和 + 分隔”。 这就是解析器在不了解运算符的情况下可以实现的目标。 上下文无关语法和词法分析器还可以解析 a +-+ b -+- c,它是由运算符 +-+ 和 -+- 分隔的三个操作数 a、b 和 c。
解析器可以根据上下文无关的 Swift 语法“解析”源文件,但这还远远没有完成。 另一个步骤是收集有关运算符的知识,然后将 a + b - c * d + e 的语义更改为与 operator+ (operator- (operator+ (a, b), operator* (c, d)) 相同, e).
因此,存在(或者可能存在,我没有仔细检查)上下文无关语法,但它只能让您到目前为止解析程序。
Let's take Swift, where the user can define operators including operator precedence and associativity. For example, the operators + and * are actually defined in the standard library.
A context free grammar and a lexer may be able to parse a + b - c * d + e, but the semantics is "five operands a, b, c, d and e, separated by the operators +, -, * and +". That's what a parser can achieve without knowing about operators. A context free grammar and a lexer may also be able to parse a +-+ b -+- c, which is three operands a, b and c separated by operators +-+ and -+-.
A parser can "parse" a source file according to a context-free Swift grammar, but that's nowhere near the job done. Another step would be collecting knowledge about operators, and then change the semantics of a + b - c * d + e to be the same as operator+ (operator- (operator+ (a, b), operator* (c, d)), e).
So there is (or maybe there is, I havent checked to closely) a context free grammar, but it only gets you so far to parsing a program.
我认为 Haskell 和 ML 支持上下文无关。 请参阅 Haskell 的链接。
I think Haskell and ML are supporting context free. See this link for Haskell.