计算机有可能“学习”吗? 用户提供的示例的正则表达式?

发布于 2024-07-14 18:47:30 字数 320 浏览 8 评论 0原文

计算机是否有可能通过用户提供的示例“学习”正则表达式?

澄清一下:

  • 不想想学习正则表达式。
  • 我想创建一个程序,它可以从用户交互提供的示例中“学习”正则表达式,也许可以通过从文本中选择部分或选择开始或结束标记来实现。

是否可以? 是否有可以通过 Google 搜索到的算法、关键字等?

编辑:谢谢您的回答,但我对提供此功能的工具不感兴趣。 我正在寻找理论信息,例如论文、教程、源代码、算法名称,这样我就可以为自己创建一些东西。

Is it possible for a computer to "learn" a regular expression by user-provided examples?

To clarify:

  • I do not want to learn regular expressions.
  • I want to create a program which "learns" a regular expression from examples which are interactively provided by a user, perhaps by selecting parts from a text or selecting begin or end markers.

Is it possible? Are there algorithms, keywords, etc. which I can Google for?

EDIT: Thank you for the answers, but I'm not interested in tools which provide this feature. I'm looking for theoretical information, like papers, tutorials, source code, names of algorithms, so I can create something for myself.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

遗忘曾经 2024-07-21 18:47:31

任何计算机程序都无法基于有效匹配列表生成有意义的正则表达式。 让我告诉你原因。

假设您提供示例 111111 和 999999,计算机应该生成: 与

  1. 这两个示例完全匹配的正则表达式: (111111|999999)
  2. 与 6 个相同数字匹配的正则表达式 (\d)\1{ 5}
  3. 匹配 6 个一和九的正则表达式 [19]{6}
  4. 匹配任意 6 个数字的正则表达式 \d{6}
  5. 以上三者中的任何一个,有字边界,例如 \b\d{6}\b
  6. 前三个中的任何一个,前后不带数字,例如
    (?

正如您所看到的,可以通过多种方式将示例概括为正则表达式。 计算机构建可预测正则表达式的唯一方法是要求您列出所有可能的匹配项。 然后它可以生成与这些匹配项完全匹配的搜索模式。

如果您不想列出所有可能的匹配项,则需要更高级别的描述。 这正是正则表达式旨在提供的功能。 您无需提供一长串 6 位数字,只需告诉程序匹配“任意六位数字”即可。 在正则表达式语法中,这变为 \d{6}。

任何提供与正则表达式一样灵活的高级描述的方法也将与正则表达式一样复杂。 像 RegexBuddy 这样的所有工具都可以使创建和测试高级描述变得更容易。 RegexBuddy 使您能够使用简单的英语构建块,而不是直接使用简洁的正则表达式语法。 但它无法为您创建高级描述,因为它无法神奇地知道何时应该概括您的示例,何时不应该概括。

当然可以创建一个使用示例文本以及用户提供的指导来生成正则表达式的工具。 设计这样一个工具的难点在于如何向用户询问所需的指导信息,而又不会使该工具比正则表达式本身更难学习,并且不会将该工具限制为常见的正则表达式作业或简单的正则表达式。

No computer program will ever be able to generate a meaningful regular expression based solely on a list of valid matches. Let me show you why.

Suppose you provide the examples 111111 and 999999, should the computer generate:

  1. A regex matching exactly those two examples: (111111|999999)
  2. A regex matching 6 identical digits (\d)\1{5}
  3. A regex matching 6 ones and nines [19]{6}
  4. A regex matching any 6 digits \d{6}
  5. Any of the above three, with word boundaries, e.g. \b\d{6}\b
  6. Any of the first three, not preceded or followed by a digit, e.g.
    (?<!\d)\d{6}(?!\d)

As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches.

If you don't want to list all possible matches, you need a higher-level description. That's exactly what regular expressions are designed to provide. Instead of providing a long list of 6-digit numbers, you simply tell the program to match "any six digits". In regular expression syntax, this becomes \d{6}.

Any method of providing a higher-level description that is as flexible as regular expressions will also be as complex as regular expressions. All tools like RegexBuddy can do is to make it easier to create and test the high-level description. Instead of using the terse regular expression syntax directly, RegexBuddy enables you to use plain English building blocks. But it can't create the high-level description for you, since it can't magically know when it should generalize your examples and when it should not.

It is certainly possible to create a tool that uses sample text along with guidelines provided by the user to generate a regular expression. The hard part in designing such a tool is how does it ask the user for the guiding information that it needs, without making the tool harder to learn than regular expressions themselves, and without restricting the tool to common regex jobs or to simple regular expressions.

爱,才寂寞 2024-07-21 18:47:31

是的,这当然是“可能的”; 这是伪代码:

string MakeRegexFromExamples(<listOfPosExamples>, <listOfNegExamples>)
{
   if HasIntersection(<listOfPosExamples>, <listOfNegExamples>)
     return <IntersectionError>

   string regex = "";
   foreach(string example in <listOfPosExamples>)
   {
      if(regex != "")
      {
         regex += "|";
      }
      regex += DoRegexEscaping(example);
   }
   regex = "^(" + regex + ")$";

   // Ignore <listOfNegExamples>; they're excluded by definition

   return regex;
}

问题是有无限数量的正则表达式可以匹配示例列表。 这段代码提供了该集中最简单/最愚蠢的正则表达式,基本上匹配正例列表中的任何内容(没有其他内容,包括任何负例)。

我认为真正的挑战是找到与所有示例相匹配的最短正则表达式,但即使如此,用户也必须提供非常好的输入以确保生成的表达式是“正确的”。

Yes, it's certainly "possible"; Here's the pseudo-code:

string MakeRegexFromExamples(<listOfPosExamples>, <listOfNegExamples>)
{
   if HasIntersection(<listOfPosExamples>, <listOfNegExamples>)
     return <IntersectionError>

   string regex = "";
   foreach(string example in <listOfPosExamples>)
   {
      if(regex != "")
      {
         regex += "|";
      }
      regex += DoRegexEscaping(example);
   }
   regex = "^(" + regex + ")$";

   // Ignore <listOfNegExamples>; they're excluded by definition

   return regex;
}

The problem is that there are an infinite number of regexs that will match a list of examples. This code provides the simplest/stupidest regex in the set, basically matching anything in the list of positive examples (and nothing else, including any of the negative examples).

I suppose the real challenge would be to find the shortest regex that matches all of the examples, but even then, the user would have to provide very good inputs to make sure the resulting expression was "the right one".

一个人的夜不怕黑 2024-07-21 18:47:31

我相信这个词是“归纳”。 你想归纳出常规语法。

我认为用有限的例子(正面或负面)来实现这一点是不可能的。 但是,如果我没记错的话,如果有可以查阅的Oracle就可以做到。 (基本上,您必须让程序询问用户是/否问题,直到用户满意为止。)

I believe the term is "induction". You want to induce a regular grammar.

I don't think it is possible with a finite set of examples (positive or negative). But, if I recall correctly, it can be done if there is an Oracle which can be consulted. (Basically you'd have to let the program ask the user yes/no questions until it was content.)

半寸时光 2024-07-21 18:47:31

您可能想玩一下这个网站,它非常酷,听起来它的功能与您正在谈论的类似:http: //txt2re.com

You might want to play with this site a bit, it's quite cool and sounds like it does something similar to what you're talking about: http://txt2re.com

我早已燃尽 2024-07-21 18:47:31

有一种基于 prolog 的语言专门用于解决此类问题。 它称为 progol

正如其他人提到的,基本思想是归纳学习,通常称为 ILP(归纳逻辑编程)在人工智能圈子里。

第二个链接是关于 ILP 的 wiki 文章,如果您有兴趣了解有关该主题的更多信息,其中包含许多有用的源材料。

There's a language dedicated to problems like this, based on prolog. It's called progol.

As others have mentioned, the basic idea is inductive learning, often called ILP (inductive logic programming) in AI circles.

Second link is the wiki article on ILP, which contains a lot of useful source material if you're interested in learning more about the topic.

℉絮湮 2024-07-21 18:47:31

@尤瓦尔是正确的。 您正在研究计算学习理论,或“归纳推理”。

这个问题比您想象的更复杂,因为“学习”的定义并不简单。 一个常见的定义是,学习者可以随时吐出答案,但最终,它必须要么停止吐出答案,要么总是吐出相同的答案。 这假设有无限数量的输入,并且绝对不能保证程序何时做出决定。 此外,您无法判断它何时做出决定,因为它稍后可能仍会输出不同的内容。

根据这个定义,我非常确定常规语言是可以学习的。 根据其他定义,没有那么多......

@Yuval is correct. You're looking at computational learning theory, or "inductive inference. "

The question is more complicated than you think, as the definition of "learn" is non-trivial. One common definition is that the learner can spit out answers whenever it wants, but eventually, it must either stop spitting out answers, or always spit out the same answer. This assumes an infinite number of inputs, and gives absolutely no garauntee on when the program will reach its decision. Also, you can't tell when it HAS reached its decision because it might still output something different later.

By this definition, I'm pretty sure that regular languages are learnable. By other definitions, not so much...

久隐师 2024-07-21 18:47:31

我在 Google 和 CiteSeer 上做了一些研究,并发现了这些技术/论文:

另外,Dana Angluin 的“从查询和反例中学习正则集”似乎很有前途,但我找不到 PS 或 PDF 版本,只有引用和研讨会论文。

看来,即使在理论层面上,这也是一个棘手的问题。

I've done some research on Google and CiteSeer and found these techniques/papers:

Also Dana Angluin's "Learning regular sets from queries and counterexamples" seems promising, but I wasn't able to find a PS or PDF version, only cites and seminar papers.

It seems that this is a tricky problem even on the theoretical level.

暖心男生 2024-07-21 18:47:31

如果一个人有可能学习正则表达式,那么程序从根本上来说也是可能的。 然而,该程序需要正确编程才能学习。 幸运的是,这是一个相当有限的逻辑空间,因此它不会像教一个程序能够看到物体或类似的东西那么复杂。

If its possible for a person to learn a regular expression, then it is fundamentally possible for a program. However, that program will need to be correctly programmed to be able to learn. Luckily this is a fairly finite space of logic, so it wouldn't be as complex as teaching a program to be able to see objects or something like that.

人生戏 2024-07-21 18:47:30

是的,
有可能的,
我们可以从示例中生成正则表达式(文本 -> 所需的提取)。
这是一个可以完成这项工作的在线工具:http://regex.inginf.units.it/

Regex Generator++ 在线工具使用 GP 搜索算法从提供的示例生成正则表达式。
GP 算法由多目标适应度驱动,可带来更高的性能和更简单的解决方案结构(奥卡姆剃刀)。
该工具是的里雅斯特大学(Università degli studi di Trieste)机器学习实验室的演示应用程序。
请在此处观看视频教程。

这是一个研究项目,因此您可以在此处了解所使用的算法。

看! :-)

从示例中找到有意义的正则表达式/解决方案是可能的当且仅当提供的示例很好地描述了问题。
考虑这些描述提取任务的示例,我们正在寻找特定的项目代码; 示例是文本/提取对:

"The product code is 467-345A" -> "467-345A"
"The item 789-345B is broken"  -> "789-345B"

一个(人类)人在查看示例时可能会说:“项目代码类似于 \d++-345[AB]”

当项目代码更宽松但我们没有提供其他内容时例如,我们没有证据来很好地理解这个问题。
当将人工生成的解决方案 \d++-345[AB] 应用到以下文本时,它会失败:

"On the back of the item there is a code: 966-347Z"

您必须提供其他示例,以便更好地描述什么是匹配,什么不是所需的匹配:
--ie:

"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"

电话号码不是产品id,这可能是一个重要的证明。

Yes,
it is possible,
we can generate regexes from examples (text -> desired extractions).
This is a working online tool which does the job: http://regex.inginf.units.it/

Regex Generator++ online tool generates a regex from provided examples using a GP search algorithm.
The GP algorithm is driven by a multiobjective fitness which leads to higher performance and simpler solution structure (Occam's Razor).
This tool is a demostrative application by the Machine Lerning Lab, Trieste Univeristy (Università degli studi di Trieste).
Please look at the video tutorial here.

This is a research project so you can read about used algorithms here.

Behold! :-)

Finding a meaningful regex/solution from examples is possible if and only if the provided examples describe the problem well.
Consider these examples that describe an extraction task, we are looking for particular item codes; the examples are text/extraction pairs:

"The product code is 467-345A" -> "467-345A"
"The item 789-345B is broken"  -> "789-345B"

An (human) guy, looking at the examples, may say: "the item codes are things like \d++-345[AB]"

When the item code is more permissive but we have not provided other examples, we have not proofs to understand the problem well.
When applying the human generated solution \d++-345[AB] to the following text, it fails:

"On the back of the item there is a code: 966-347Z"

You have to provide other examples, in order to better describe what is a match and what is not a desired match:
--i.e:

"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"

The phone number is not a product id, this may be an important proof.

落叶缤纷 2024-07-21 18:47:30

计算学习理论简介一书包含一种用于学习有限矩阵的算法自动机。 由于每种正则语言都相当于一个有限自动机,因此可以通过程序学习一些正则表达式。 卡恩斯和 Valiant展示了一些无法学习有限自动机的情况。 一个相关的问题是学习隐马尔可夫模型,它们是概率自动机,可以描述字符序列。 请注意,编程语言中使用的大多数现代“正则表达式”实际上比正则语言更强大,因此有时更难学习。

The book An Introduction to Computational Learning Theory contains an algorithm for learning a finite automaton. As every regular language is equivalent to a finite automaton, it is possible to learn some regular expressions by a program. Kearns and Valiant show some cases where it is not possible to learn a finite automaton. A related problem is learning hidden Markov Models, which are probabilistic automata that can describe a character sequence. Note that most modern "regular expressions" used in programming languages are actually stronger than regular languages, and therefore sometimes harder to learn.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文