当前位置：文江博客话题详情

Stream 或 Iterator 来生成与正则表达式匹配的所有字符串？

发布于 2024-12-24 20:13:05 字数 407 浏览 2 评论 0原文

这是我之前的问题的后续问题。

假设我想生成与给定（简化）正则表达式匹配的所有字符串。

这只是一个编码练习，我没有任何额外的要求（例如，实际上生成了多少个字符串）。所以主要的要求是生成漂亮、干净、简单的代码。

我考虑过使用 Stream 但在阅读这个问题我正在考虑迭代器。你会用什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

远山浅 2024-12-31 20:13:05

该问题的解决方案需要太多代码，因此无法在此处实际回答，但大纲如下。

首先，您想要解析正则表达式 - 例如，您可以为此研究解析器组合器。然后，您将拥有一个评估树，例如，

List(
  Constant("abc"),
  ZeroOrOne(Constant("d")),
  Constant("efg"),
  OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
  Constant("qrs"),
  AnyChar()
)

您可以通过定义生成器将其作为生成器运行，而不是将此表达式树作为匹配器运行每个术语的方法。对于某些术语（例如ZeroOrOne(Constant("d"))），会有多个选项，因此您可以定义一个迭代器。实现此目的的一种方法是在每个术语中存储内部状态并传入“提前”标志或“重置”标志。在“重置”时，生成器返回第一个可能的匹配项（例如 ""）；提前，它会转到下一个并返回该值（例如 "d"），同时消耗提前标志（留下其余的在没有标志的情况下进行评估）。如果没有更多的项目，它将对自身内部的所有内容进行重置，并为下一个项目保留完整的提前标志。你首先要进行重置运行；在每次迭代中，您都会提前输入，并在再次取出时停止。

当然，一些正则表达式结构（例如 "d+" ）可以生成无限多个值，因此您可能希望以某种方式限制它们（或者在某些时候返回，例如 d...d 意思是“很多”）；其他的有很多可能的值（例如 . 匹配任何字符，但您真的想要所有 64k 字符，还是有很多 unicode 代码点？），您可能也希望限制这些值。

不管怎样，这虽然很耗时，但会产生一个可以工作的发电机。而且，顺便说一句，如果您为解析树的每个部分编写匹配例程，您还将拥有一个可用的正则表达式匹配器。

The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.

First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,

List(
  Constant("abc"),
  ZeroOrOne(Constant("d")),
  Constant("efg"),
  OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
  Constant("qrs"),
  AnyChar()
)

Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.

Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.

Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.

回复收藏 0 原文

~没有更多了~