当前位置：文江博客话题详情

哪些编程语言是上下文无关的？

发布于 2024-07-22 06:14:14 字数 142 浏览 13 评论 0原文

或者，更准确地说：哪些编程语言是由上下文无关语法定义的？

据我所知，由于宏和模板之类的原因，C++ 并不是上下文无关的。我的直觉告诉我，函数式语言可能是上下文无关的，但我没有任何硬数据来支持这一点。

简洁示例的额外代表:-)

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风尘浪孓 2024-07-29 06:14:14

哪些编程语言是上下文无关的？ [...]

我的直觉告诉我函数式语言可能是上下文无关的[...]

简短版本：现实世界中几乎没有任何编程语言是上下文无关的这个词的任何含义。一种语言是否与上下文无关与其功能性无关。这只是语法复杂程度的问题。

这是命令式语言 Brainfuck 的 CFG：

Program → Instr Program | ε
Instr → '+' | '-' | '>' | '<' | ',' | '.' | '[' Program ']'

这是函数 SKI 组合器演算：

Program → E
E → 'S' E E E
E → 'K' E E
E → 'I'
E → '(' E ')'

这些 CFG 识别所有< /em> 两种语言的有效程序，因为它们非常简单。

较长的版本：通常，上下文无关语法（CFG）仅用于粗略指定语言的语法。人们必须区分语法正确的程序和正确编译/评估的程序。最常见的是，编译器将语言分析分为构建和验证一段代码的一般结构的语法分析和验证含义的语义分析> 的程序。

如果“上下文无关语言”的意思是“...所有程序都为其编译”，那么答案是：几乎没有。符合这一要求的语言几乎没有任何规则或复杂的功能，例如变量的存在、空白敏感性、类型系统或任何其他上下文：在一个地方定义并在另一个地方依赖的信息。

另一方面，如果“上下文无关语言”仅意味着“...所有程序都通过语法分析”，那么答案就在于语法本身有多复杂。有许多句法特征很难或不可能仅用 CFG 来描述。其中一些问题可以通过向解析器添加额外的状态来跟踪计数器、查找表等来克服。

无法使用 CFG 表达的语法特征示例：

对缩进和空格敏感的语言，例如 Python 和 Haskell。跟踪任意嵌套的缩进级别本质上是上下文相关的，并且需要单独的缩进级别计数器；每个级别使用多少空间以及有多少级别。
通过为每个缩进级别重复语法，仅允许使用固定数量的空格进行固定级别的缩进，但实际上这很不方便。
C Typedef 解析问题说 C程序在词法分析期间是不明确的，因为它无法仅从语法中知道某物是否是现有类型的常规标识符或 typedef 别名。
示例是：
<块引用>
```
 typedef int my_int; 
    my_int x; 
  
```
<块引用>
在分号处，需要使用 my_int 条目更新类型环境。但是，如果词法分析器已经预先查找 my_int，它会将其词法分析为标识符而不是类型名称。
在上下文无关语法术语中，在 my_int 上触发的 X → ... 规则是不明确的：它可以是生成标识符的规则，也可以是生成标识符的规则。一个产生 typedef'ed 类型的类型；知道哪一个依赖于语法本身之外的查找表（上下文）。
基于宏和模板的语言，例如 Lisp、C++、Template Haskell、Nim 等。由于语法在解析时会发生变化，因此一种解决方案是将解析器变成一个自修改程序。另请参阅 C++ 是上下文无关的还是上下文敏感的？
通常，运算符优先级和结合性不会直接在 CFG 中表达，尽管它是可能的。例如，小表达式语法的 CFG（其中 ^ 比 × 绑定更紧密，× 比 + 绑定更紧密）可能如下所示：
<前><代码> E → E ^ E
E→E×E
E → E + E
E → (E)
E → 数字
但是，此 CFG 不明确，并且通常伴随有优先级/结合性表表示 ^ 结合最紧密，× 比 + 结合更紧密，^ 是右结合，并且 × 和 + 是左关联的。
优先级和关联性可以以机械方式编码到 CFG 中，这样它就不会产生歧义，并且仅在运算符行为正确的情况下生成语法树。上述语法的示例：
<前><代码> E₀ → EA E₁
EA → E₁ + EA
EA → ε
E₁ → EM E2
EM → E2 × EM
EM → ε
E2 → E₃ EP
EP → ^ E₃ EP
E₃ → 数字
E₃ → (E₀)
但是不明确的 CFG + 优先级/关联性表很常见，因为它们更具可读性，并且因为各种类型的 LR解析器生成器库可以通过消除移位/归约冲突，而不是处理更大尺寸的明确的、转换的语法。

理论上，所有有限的字符串集都是正则语言，因此所有有界大小的合法程序都是正则的。由于常规语言是上下文无关语言的子集，因此所有有限大小的程序都是上下文无关的。争论还在继续，

虽然可以说，对于一种语言来说，只允许少于一百万行的程序是可以接受的限制，但将编程语言描述为常规语言是不切实际的：描述太大了.
— Torben Morgensen 的编译器设计基础，第 1 章 2.10.2

CFG 也是如此。为了以不同的方式解决你的子问题，

哪些编程语言是由上下文无关语法定义的？

大多数现实世界的编程语言都是由其实现来定义的，并且大多数现实世界编程语言的解析器要么是手写的，要么使用扩展上下文无关解析的解析器生成器。遗憾的是，为您喜欢的语言找到准确的 CFG 并不常见。当您这样做时，通常采用 Backus-Naur 形式 (BNF)，或者很可能不是纯粹上下文无关的解析器规范。

来自野外的语法规范示例：

标准 ML 的 BNF
< a href="https://www.haskell.org/onlinereport/syntax-iso.html" rel="noreferrer">Haskell 类似 BNF
SQL 的 BNF
PHP 的 Yacc 语法

What programming languages are context-free? [...]

My gut tells me that functional languages might be context-free [...]

The short version: There are hardly any real-world programming languages that are context-free in any meaning of the word. Whether a language is context-free or not has nothing to do with it being functional. It is simply a matter of how complex the syntax is.

Here's a CFG for the imperative language Brainfuck:

Program → Instr Program | ε
Instr → '+' | '-' | '>' | '<' | ',' | '.' | '[' Program ']'

And here's a CFG for the functional SKI combinator calculus:

Program → E
E → 'S' E E E
E → 'K' E E
E → 'I'
E → '(' E ')'

These CFGs recognize all valid programs of the two languages because they're so simple.

The longer version: Usually, context-free grammars (CFGs) are only used to roughly specify the syntax of a language. One must distinguish between syntactically correct programs and programs that compile/evaluate correctly. Most commonly, compilers split language analysis into syntax analysis that builds and verifies the general structure of a piece of code, and semantic analysis that verifies the meaning of the program.

If by "context-free language" you mean "... for which all programs compile", then the answer is: hardly any. Languages that fit this bill hardly have any rules or complicated features, like the existence of variables, whitespace-sensitivity, a type system, or any other context: Information defined in one place and relied upon in another.

If, on the other hand, "context-free language" only means "... for which all programs pass syntax analysis", the answer is a matter of how complex the syntax alone is. There are many syntactic features that are hard or impossible to describe with a CFG alone. Some of these are overcome by adding additional state to parsers for keeping track of counters, lookup tables, and so on.

Examples of syntactic features that are not possible to express with a CFG:

Indentation- and whitespace-sensitive languages like Python and Haskell. Keeping track of arbitrarily nested indentation levels is essentially context-sensitive and requires separate counters for the indentation level; both how many spaces that are used for each level and how many levels there are.
Allowing only a fixed level of indentation using a fixed amount of spaces would work by duplicating the grammar for each level of indentation, but in practice this is inconvenient.
The C Typedef Parsing Problem says that C programs are ambiguous during lexical analysis because it cannot know from the grammar alone if something is a regular identifier or a typedef alias for an existing type.
The example is:
```
  typedef int my_int;
  my_int x;
```
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name.
In context-free grammar terms, the X → ... rule that would trigger on my_int is ambiguous: It could be either one that produces an identifier, or one that produces a typedef'ed type; knowing which one relies on a lookup table (context) beyond the grammar itself.
Macro- and template-based languages like Lisp, C++, Template Haskell, Nim, and so on. Since the syntax changes as it is being parsed, one solution is to make the parser into a self-modifying program. See also Is C++ context-free or context-sensitive?
Often, operator precedence and associativity are not expressed directly in CFGs even though it is possible. For example, a CFG for a small expression grammar where ^ binds tighter than ×, and × binds tighter than +, might look like this:
```
  E → E ^ E
  E → E × E
  E → E + E
  E → (E)
  E → num
```
This CFG is ambiguous, however, and is often accompanied by a precedence / associativity table saying e.g. that ^ binds tightest, × binds tighter than +, that ^ is right-associative, and that × and + are left-associative.
Precedence and associativity can be encoded into a CFG in a mechanical way such that it is unambiguous and only produces syntax trees where the operators behave correctly. An example of this for the grammar above:
```
  E₀ → EA E₁
  EA → E₁ + EA
  EA → ε
  E₁ → EM E₂
  EM → E₂ × EM
  EM → ε
  E₂ → E₃ EP
  EP → ^ E₃ EP
  E₃ → num
  E₃ → (E₀)
```
But ambiguous CFGs + precedence / associativity tables are common because they're more readable and because various types of LR parser generator libraries can produce more efficient parsers by eliminating shift/reduce conflicts instead of dealing with an unambiguous, transformed grammar of a larger size.

In theory, all finite sets of strings are regular languages, and so all legal programs of bounded size are regular. Since regular languages are a subset of context-free languages, all programs of bounded size are context-free. The argument continues,

While it can be argued that it would be an acceptable limitation for a language to allow only programs of less than a million lines, it is not practical to describe a programming language as a regular language: The description would be far too large.
— Torben Morgensen's Basics of Compiler Design, ch. 2.10.2

The same goes for CFGs. To address your sub-question a little differently,

Which programming languages are defined by a context-free grammar?

Most real-world programming languages are defined by their implementations, and most parsers for real-world programming languages are either hand-written or uses a parser generator that extends context-free parsing. It is unfortunately not that common to find an exact CFG for your favourite language. When you do, it's usually in Backus-Naur form (BNF), or a parser specification that most likely isn't purely context-free.

Examples of grammar specifications from the wild:

回复收藏 0 原文

已下线请稍等 2024-07-29 06:14:14

对于几乎所有语言来说，语法正确的程序集都是上下文无关的。

对于几乎所有语言来说，编译的程序集都不是上下文无关的。例如，如果所有编译 C 程序的集合是上下文无关的，那么通过与正则语言（也称为正则表达式）相交，所有匹配的编译 C 程序的集合

^int main\(void\) { int a+; a+ = a+; return 0; }$

将是上下文无关的，但这显然是同构的到语言 a^kba^kba^k，众所周知，它不是上下文无关的。

The set of programs that are syntactically correct is context-free for almost all languages.

The set of programs that compile is not context-free for almost all languages. For example, if the set of all compiling C programs were context free, then by intersecting with a regular language (also known as a regex), the set of all compiling C programs that match

^int main\(void\) { int a+; a+ = a+; return 0; }$

would be context-free, but this is clearly isomorphic to the language a^kba^kba^k, which is well-known not to be context-free.

回复收藏 0 原文

╰沐子 2024-07-29 06:14:14

根据您对问题的理解方式，答案会发生变化。但恕我直言，正确的答案是所有现代编程语言实际上都是上下文相关的。例如，不存在仅接受语法正确的 C 程序的上下文无关语法。那些指出 C 的 yacc/bison 上下文无关语法的人没有抓住重点。

回复收藏 0 原文

情泪▽动烟 2024-07-29 06:14:14

举一个非上下文无关语法的最戏剧性的例子，据我所知，Perl 的语法是图灵-完成。

回复收藏 0 原文

只是我以为 2024-07-29 06:14:14

如果我理解你的问题，那么你正在寻找可以通过上下文无关语法（cfg）描述的编程语言，以便 cfg 生成所有有效的程序并且仅生成有效的程序。

因此，我相信大多数（如果不是全部）现代编程语言都不是上下文无关的。例如，一旦您拥有用户定义的类型（在现代语言中非常常见），您就会自动对上下文敏感。

验证程序的语法和验证语义正确性之间存在差异。检查语法与上下文无关，而检查语义正确性则不然（同样，在大多数语言中）。

然而，这并不意味着这样的语言不存在。例如，非类型化 lambda 演算可以使用上下文无关语法来描述，并且当然是，图灵完备。

回复收藏 0 原文

你的心境我的脸 2024-07-29 06:14:14

大多数现代编程语言都不是上下文无关语言。作为证明，如果我深入研究 CFL 的根源，其相应的机器 PDA 无法处理像 {ww | } 这样的字符串匹配。 w 是一个字符串}。所以大多数编程语言都需要这样做。

例子：

int fa; // w
fa=1; // ww as parser treat it like this

Most of the modern programming languages are not context-free languages. As a proof, if I delve into the root of CFL its corresponding machine PDA can't process string matchings like {ww | w is a string}. So most programming languages require that.

Example:

int fa; // w
fa=1; // ww as parser treat it like this

回复收藏 0 原文

单挑你×的.吻 2024-07-29 06:14:14

VHDL 在某种程度上是上下文相关的：

VHDL 在某种程度上是上下文相关的。考虑这个声明在
流程：
jinx := foo(1); 
  
嗯，取决于进程范围内定义的对象（及其
封闭范围），这可以是：
函数调用
为数组建立索引
为无参数函数调用返回的数组建立索引
为了正确解析它，解析器必须携带分层符号表
（具有封闭范围），并且当前文件还不够。 foo 可以是
包中定义的函数。所以解析器应该首先分析包
由正在解析的文件导入，并找出其中定义的符号。
这只是一个例子。 VHDL 类型/子类型系统是类似的
上下文相关的混乱非常难以解析。

（Eli Bendersky，“解析 VHDL [非常] 困难”，2009）

VHDL is somewhat context sensitive:

VHDL is context-sensitive in a mean way. Consider this statement inside a
process:
jinx := foo(1);
Well, depending on the objects defined in the scope of the process (and its
enclosing scopes), this can be either:
A function call
Indexing an array
Indexing an array returned by a parameter-less function call
To parse this correctly, a parser has to carry a hierarchical symbol table
(with enclosing scopes), and the current file isn't even enough. foo can be a
function defined in a package. So the parser should first analyze the packages
imported by the file it's parsing, and figure out the symbols defined in them.
This is just an example. The VHDL type/subtype system is a similarly
context-sensitive mess that's very difficult to parse.

(Eli Bendersky, “Parsing VHDL is [very] hard”, 2009)

回复收藏 0 原文

鹿港小镇 2024-07-29 06:14:14

让我们以 Swift 为例，用户可以定义运算符，包括运算符优先级和关联性。例如，运算符 + 和 * 实际上是在标准库中定义的。

上下文无关语法和词法分析器可能能够解析 a + b - c * d + e，但语义是“五个操作数 a、b、c、d 和 e，由运算符 +、-、* 和 + 分隔”。这就是解析器在不了解运算符的情况下可以实现的目标。上下文无关语法和词法分析器还可以解析 a +-+ b -+- c，它是由运算符 +-+ 和 -+- 分隔的三个操作数 a、b 和 c。

解析器可以根据上下文无关的 Swift 语法“解析”源文件，但这还远远没有完成。另一个步骤是收集有关运算符的知识，然后将 a + b - c * d + e 的语义更改为与 operator+ (operator- (operator+ (a, b), operator* (c, d)) 相同， e).

因此，存在（或者可能存在，我没有仔细检查）上下文无关语法，但它只能让您到目前为止解析程序。

回复收藏 0 原文