当前位置：文江博客话题详情

什么是常规语言？

发布于 2024-11-24 13:40:55 字数 486 浏览 3 评论 0原文

我试图理解语言级别的概念（常规、上下文无关、上下文敏感等）。

我可以轻松查找此内容，但我找到的所有解释都是一堆符号并谈论集合。我有两个问题：

你能用文字描述什么是常规语言，以及这些语言有何不同吗？
人们从哪里学习理解这些东西？据我了解，这是形式数学？我在大学有几门课程使用了它，但几乎没有人理解它，因为导师只是假设我们知道它。我可以在哪里学习它？为什么人们“期望”通过这么多来源了解它？就好像教育方面存在差距。

这是一个示例：

属于该集合的任何语言都是字母表上的常规语言。

语言怎么能“超越”任何事物呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

纸伞微斜 2024-12-01 13:40:55

在计算机科学的背景下，单词是符号的串联。使用的符号称为字母。例如，一些由字母表 {0,1,2,3,4,5,6,7,8,9} 组成的单词将是 1, <代码>2、12、543、1000和002。

那么语言就是所有可能单词的子集。例如，我们可能想要定义一种语言来捕获军情六处的所有精英特工。这些都以双 0 开头，因此该语言中的单词将是 007、001、005 和 0012 >，但不是 07 或 15。为了简单起见，我们说语言是“在字母表上”，而不是“由字母表中符号串联而成的单词子集”。

在计算机科学中，我们现在想要对语言进行分类。如果一种语言可以通过逐个检查单词中的所有符号来确定一个单词是否属于具有恒定（有限）内存的算法/机器的语言，那么我们将这种语言称为“常规”语言。仅由单词 42 组成的语言是常规的，因为您可以确定单词是否在其中，而不需要任意数量的内存；你只需检查第一个符号是否为 4，第二个符号是否为 2，以及后面是否还有其他数字。

所有具有有限单词数的语言都是常规的，因为我们（理论上）可以构建一个恒定大小的控制流树（您可以将其想象为一堆嵌套的 if 语句，这些语句检查一个另一个之后的数字）。例如，我们可以使用以下构造来测试一个单词是否属于“10 到 99 之间的素数”语言，除了在我们当前所在的代码行进行编码的内存之外，不需要任何内存：

if word[0] == 1:
  if word[1] == 1: # 11
      return true # "accept" word, i.e. it's in the language
  if word[1] == 3: # 13
      return true
...
return false

请注意，所有有限语言都是正则语言，但并非所有正则语言都是有限的；我们的双 0 语言包含无限数量的单词（007、008，还有 004242 和 0012345），但可以用常量记忆来测试：要测试一个单词是否属于其中，请检查第一个符号是否为0，以及第二个符号是否为0。如果是这样的话，就接受吧。如果该单词短于三个或不以 00 开头，则它不是 MI6 代号。

形式上，有限状态机或正则语法用于证明语言是正则的。这些类似于上面的 if 语句，但允许任意长的单词。如果存在有限状态机，则也存在正则语法，反之亦然，因此显示其中任何一个就足够了。例如，我们的双 0 语言的有限状态机是：

start state:  if input = 0 then goto state 2
start state:  if input = 1 then fail
start state:  if input = 2 then fail
...
state 2: if input = 0 then accept
state 2: if input != 0 then fail
accept: for any input, accept

等效的正则语法是：

start → 0 B
B → 0 accept
accept → 0 accept
accept → 1 accept
...

等效的正则表达式是：

00[0-9]*

有些语言不是正则的。例如，任意数量的1的语言，后面跟着相同数量的2（通常写为1ⁿ2ⁿ，对于任意n）是不规则的 - 您需要超过恒定量的内存（=恒定数量的状态）来存储1 来决定一个单词是否在该语言中。

这通常应该在理论计算机科学课程中进行解释。幸运的是，维基百科解释了正式和常规语言非常好。

In the context of computer science, a word is the concatenation of symbols. The used symbols are called the alphabet. For example, some words formed out of the alphabet {0,1,2,3,4,5,6,7,8,9} would be 1, 2, 12, 543, 1000, and 002.

A language is then a subset of all possible words. For example, we might want to define a language that captures all elite MI6 agents. Those all start with double-0, so words in the language would be 007, 001, 005, and 0012, but not 07 or 15. For simplicity's sake, we say a language is "over an alphabet" instead of "a subset of words formed by concatenation of symbols in an alphabet".

In computer science, we now want to classify languages. We call a language regular if it can be decided if a word is in the language with an algorithm/a machine with constant (finite) memory by examining all symbols in the word one after another. The language consisting just of the word 42 is regular, as you can decide whether a word is in it without requiring arbitrary amounts of memory; you just check whether the first symbol is 4, whether the second is 2, and whether any more numbers follow.

All languages with a finite number of words are regular, because we can (in theory) just build a control flow tree of constant size (you can visualize it as a bunch of nested if-statements that examine one digit after the other). For example, we can test whether a word is in the "prime numbers between 10 and 99" language with the following construct, requiring no memory except the one to encode at which code line we're currently at:

if word[0] == 1:
  if word[1] == 1: # 11
      return true # "accept" word, i.e. it's in the language
  if word[1] == 3: # 13
      return true
...
return false

Note that all finite languages are regular, but not all regular languages are finite; our double-0 language contains an infinite number of words (007, 008, but also 004242 and 0012345), but can be tested with constant memory: To test whether a word belongs in it, check whether the first symbol is 0, and whether the second symbol is 0. If that's the case, accept it. If the word is shorter than three or does not start with 00, it's not an MI6 code name.

Formally, the construct of a finite-state machine or a regular grammar is used to prove that a language is regular. These are similar to the if-statements above, but allow for arbitrarily long words. If there's a finite-state machine, there is also a regular grammar, and vice versa, so it's sufficient to show either. For example, the finite state machine for our double-0 language is:

start state:  if input = 0 then goto state 2
start state:  if input = 1 then fail
start state:  if input = 2 then fail
...
state 2: if input = 0 then accept
state 2: if input != 0 then fail
accept: for any input, accept

The equivalent regular grammar is:

start → 0 B
B → 0 accept
accept → 0 accept
accept → 1 accept
...

The equivalent regular expression is:

00[0-9]*

Some languages are not regular. For example, the language of any number of 1, followed by the same number of 2 (often written as 1ⁿ2ⁿ, for an arbitrary n) is not regular - you need more than a constant amount of memory (= a constant number of states) to store the number of 1s to decide whether a word is in the language.

This should usually be explained in the theoretical computer science course. Luckily, Wikipedia explains both formal and regular languages quite nicely.

回复收藏 0 原文

涫野音 2024-12-01 13:40:55

以下是维基百科中的一些等效定义：

[...] 常规语言是一种形式语言（即，可能是
有限字母表中符号的有限序列的无限集合）
满足以下等效属性：
它可以被确定性有限状态机接受。
它可以被非确定性有限状态机接受
它可以用正式的正则表达式来描述。
请注意，许多编程语言都提供了“正则表达式”功能
增强了使它们能够识别的功能
不规则的语言，因此不严格
相当于正式的正则表达式。

首先要注意的是，常规语言是正式语言，有一些限制。形式语言本质上是字符串的（可能是无限的）集合。例如，形式语言Java是所有可能的Java文件的集合，它是所有可能的文本文件集合的子集。

最重要的特征之一是，与上下文无关语言不同，常规语言不支持任意嵌套/递归，但可以任意重复。

语言总是有一个底层字母表，它是允许的符号集。例如，编程语言的字母表通常是 ASCII 或 Unicode，但在形式语言理论中，谈论语言而不是其他字母表也可以，例如二进制字母表，其中唯一允许的字符是 0和<代码>1。

在我的大学，我们在编译器课上教授了一些形式语言理论，但这可能在不同学校之间有所不同。