当前位置：文江博客话题详情

提取字符串表达式表示的所有可能的字符串集（汤普森构造的一部分）

发布于 2025-01-11 07:07:36 字数 1547 浏览 0 评论 0原文

（这是我在这里发表的第一篇文章，所以请记住，如果我的要求实在太多了，如果我的问题太具体、太宽泛，而不是集中在以下问题上，我应该考虑一次仔细地解决这个问题：一个问题，需要更多的信息，否则，请毫不犹豫地评论）

在 C++ 中，我目前正在以某种方式不使用正则表达式标头进行 Thompson 的构造> 我焦急地不知道如何继续为什么，

假设我有以下字符串表达式，其中 '.'是连接，'|'是 Union，'*' 是 KleeneStar，我们可以假设语法已经事先检查过，并且它们可以读作：

a 或

(a)|(b) 或

(a).(b) 或

((a)|(b)).(c) 或

((c)|(d)) .((a)|(b)) 或

(a)* 或

((a)*)|((b)*) 或

这里的目标是提取、调用或读取要放置的字符串表达式输出一组字符串作为其相应表达式的预期输出（注意空字符串可以输出为“_”），其中：

a 输出 {a}

< code>(a)|(b) 输出 {a, b}

(a).(b) 输出 {ab}

((a)|(b)).(c) 输出 {ac, bc}

((c)|(d)).((a)|(b)) 输出 {ca, da, cb , db}

(a)* 输出{_, a, aa, aaa, aaaa, ...}（我们可以考虑在这里设置限制）

((a)*)|(( b)*) 输出 {_, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, ...} （我们可以考虑在这里设置一个限制）

我确信有更好的方法可以使用算法、伪代码、解析等来预先做到这一点，或者我缺乏实践使我在编码知识上存在严重差距执行这样的事情。

到目前为止，我认真地想知道许多因素，例如要创建哪些函数，要进行哪些递归调用，指针，堆栈，向量，何时何地等，并且这样的过程可能会给程序带来真正的压力计算我所实现的所有可能性，例如，匹配用户的输入字符串以匹配列出的字符串集。

也许我在论坛上浏览过的一篇帖子没有完全阅读到足以理解。我对自己的观点完全不知所措，希望能一步一步地了解如何做，非常感谢进一步的指导和帮助。任何解决方案甚至您自己发布的代码的完整演练将对我的学习有很大帮助。任何建议都会很有帮助。再次强调，如果我忘记提及任何关键术语或任何具体内容，请发表您的评论，感谢您的宝贵时间。

编辑：有人告诉我，我的建议可能与 KleeneStar 运算符非常相关，并且生成输出可能会变得不平凡？我想知道是否可以设置 int 限制，或者仍然可能严重导致更多并发症。

原文

(This is my first post here so do bear in mind if what I'm asking is seriously too much and that I should consider tackling this piece carefully one at a time, if my question is too specific, too broad, not focused down to a single problem, requiring significantly more info, or otherwise, please don't hesitate to comment down)

In C++, I'm currently in the process of doing Thompson's Construction in a certain way without using the regex header and I'm anxiously stuck on how to proceed and why,

Say I have the following String expression in which '.' is Concatenation, '|' is Union, and '*' is KleeneStar, we can assume the syntax is already checked beforehand, and that they can be read as:

a or

(a)|(b) or

(a).(b) or

((a)|(b)).(c) or

((c)|(d)).((a)|(b)) or

(a)* or

((a)*)|((b)*) or

The goal here is to extract, call, or read the string expression that would lay out a set of strings as its expected output resulting from their corresponding expression (note an empty string could be output as '_') in which:

a outputs {a}

(a)|(b) outputs {a, b}

(a).(b) outputs {ab}

((a)|(b)).(c) outputs {ac, bc}

((c)|(d)).((a)|(b)) outputs {ca, da, cb, db}

(a)* outputs {_, a, aa, aaa, aaaa, ...} (we can consider putting a limit here)

((a)*)|((b)*) outputs {_, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, ...} (we can consider putting a limit here)

I'm sure there's a better way to do this beforehand using algorithms, pseudo-code, parsing, and more, or that my lack of practice leaves me serious gaps in my coding knowledge to perform such a thing.

As of now, I'm seriously wondering a number of factors such as what functions to make, what recursion calls to make, pointers, stacks, vectors, where and when, etc., and that such a process can perhaps really stress the program to calculate ALL possibilities for whatever I'm achieving, say, matching an input string from a user to match the set of strings listed.

and that perhaps one of the posts I've looked over in forums wasn't fully read over enough to understand. I'm completely overwhelmed in my perspective and would like a walkthrough on what to do one step at a time, further guidance and assistance are seriously greatly appreciated. A complete walkthrough for any solutions or even your own codes posted would be of great help for me to learn. Any suggestions would really help. Again any critical terms that I've forgotten to mention or anything specific please post your comments, Thank you for your time.

EDIT:
I've been told that what I'm suggesting can be quite concerning with the KleeneStar operator and that generating the output would may become nontrivial? I was wondering if setting an int limit would be possible or that could still seriously cause even more complications.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

贪了杯 2025-01-18 07:07:36

我们可以使用 Thompson 的构造将正则表达式转换为具有 ε 转移的非确定性有限自动机，然后在图中运行 Dijkstra 算法，其中节点是 (q, x) 对，其中 q 是状态，x 是字符串，并且弧从 (q, x) 到 (δ(q, σ), x σ)，其中 σ 是符号（权重 1，双端队列的后面）或 ε（权重 0，双端队列的前面）双端队列）。通过按此顺序枚举，我们可以轻松添加大小限制。

我们用递归下降来解析正则表达式。我不打算进一步解释；那里有一百万个计算器教程。

不要在下面的生产级 C++ 中运行此代码。

#include <deque>
#include <iostream>
#include <tuple>
#include <unordered_set>
#include <vector>

struct State;

struct Transition {
  char symbol;
  State *successor;
};

struct State {
  std::vector<State *> epsilon_transitions;
  std::vector<Transition> transitions;
  std::unordered_set<std::string> visited_strings;
};

struct Automaton {
  State *initial_state;
  State *final_state;
};

Automaton Eps() {
  Automaton c;
  c.initial_state = c.final_state = new State;
  return c;
}

Automaton Sym(char symbol) {
  Automaton c;
  c.initial_state = new State;
  c.final_state = new State;
  c.initial_state->transitions.push_back({symbol, c.final_state});
  return c;
}

Automaton Star(Automaton a) {
  Automaton c;
  c.initial_state = c.final_state = new State;
  c.final_state->epsilon_transitions.push_back(a.initial_state);
  a.final_state->epsilon_transitions.push_back(c.final_state);
  return c;
}

Automaton Cat(Automaton a, Automaton b) {
  Automaton c;
  c.initial_state = a.initial_state;
  c.final_state = b.final_state;
  a.final_state->epsilon_transitions.push_back(b.initial_state);
  return c;
}

Automaton Alt(Automaton a, Automaton b) {
  Automaton c;
  c.initial_state = new State;
  c.final_state = new State;
  c.initial_state->epsilon_transitions.push_back(a.initial_state);
  c.initial_state->epsilon_transitions.push_back(b.initial_state);
  a.final_state->epsilon_transitions.push_back(c.final_state);
  b.final_state->epsilon_transitions.push_back(c.final_state);
  return c;
}

void Enumerate(Automaton a, std::size_t size_limit = 4) {
  std::deque<std::tuple<State *, std::string>> queue = {
      {a.initial_state, std::string()}};
  do {
    auto [state, string] = queue.front();
    if (string.size() > size_limit) {
      std::cout << "...\n";
      break;
    }
    queue.pop_front();
    if (!state->visited_strings.insert(string).second) {
      continue;
    }
    if (state == a.final_state) {
      std::cout << "\"" << string << "\"\n";
    }
    for (State *successor : state->epsilon_transitions) {
      queue.push_front({successor, string});
    }
    for (auto [symbol, successor] : state->transitions) {
      queue.push_back({successor, string + std::string(1, symbol)});
    }
  } while (!queue.empty());
  std::cout << "\n";
}

#include <cctype>
#include <cstdio>
#include <istream>
#include <optional>
#include <sstream>

char Peek(std::istream &input) {
  input >> std::ws;
  return input.peek();
}

bool Match(std::istream &input, char c) {
  if (Peek(input) == c) {
    input.get();
    return true;
  }
  return false;
}

std::optional<char> ParseSymbol(std::istream &input) {
  if (std::isalpha(Peek(input))) {
    return input.get();
  }
  return std::nullopt;
}

std::optional<Automaton> Parse(std::istream &input);

std::optional<Automaton> ParsePrimitive(std::istream &input) {
  if (auto maybe_symbol = ParseSymbol(input)) {
    return Sym(maybe_symbol.value());
  }
  if (Match(input, '(')) {
    if (auto maybe_a = Parse(input)) {
      if (Match(input, ')')) {
        return maybe_a.value();
      }
    }
  }
  return std::nullopt;
}

std::optional<Automaton> ParseFactor(std::istream &input) {
  if (auto maybe_a = ParsePrimitive(input)) {
    if (Match(input, '*')) {
      return Star(*maybe_a);
    }
    return *maybe_a;
  }
  return std::nullopt;
}

std::optional<Automaton> ParseTerm(std::istream &input) {
  if (auto maybe_a = ParseFactor(input)) {
    Automaton a = maybe_a.value();
    while (Match(input, '.')) {
      if (auto maybe_b = ParseFactor(input)) {
        a = Cat(a, maybe_b.value());
      } else {
        return std::nullopt;
      }
    }
    return a;
  }
  return std::nullopt;
}

std::optional<Automaton> Parse(std::istream &input) {
  if (auto maybe_a = ParseTerm(input)) {
    Automaton a = maybe_a.value();
    while (Match(input, '|')) {
      if (auto maybe_b = ParseTerm(input)) {
        a = Alt(a, maybe_b.value());
      } else {
        return std::nullopt;
      }
    }
    return a;
  }
  return std::nullopt;
}

std::optional<Automaton> Parse(const std::string &input) {
  std::stringstream stream(input);
  if (auto maybe_a = Parse(stream)) {
    if (Peek(stream) == EOF) {
      return maybe_a.value();
    }
  }
  return std::nullopt;
}

int main() {
  Enumerate(Parse("a").value());
  Enumerate(Parse("a|b").value());
  Enumerate(Parse("a.b").value());
  Enumerate(Parse("(a|b).c").value());
  Enumerate(Parse("(c|d).(a|b)").value());
  Enumerate(Parse("a*").value());
  Enumerate(Parse("(a|b)*").value());
}

输出：

"a"

"b"
"a"

"ab"

"bc"
"ac"

"db"
"da"
"cb"
"ca"

""
"a"
"aa"
"aaa"
"aaaa"
...

""
"b"
"a"
"bb"
"ba"
"ab"
"aa"
"bbb"
"bba"
"bab"
"baa"
"abb"
"aba"
"aab"
"aaa"
"bbbb"
"bbba"
"bbab"
"bbaa"
"babb"
"baba"
"baab"
"baaa"
"abbb"
"abba"
"abab"
"abaa"
"aabb"
"aaba"
"aaab"
"aaaa"
...

We can use Thompson's construction to convert the regular expression into a nondeterministic finite automaton with ε-transitions, then run Dijkstra's algorithm in a graph where the nodes are (q, x) pairs where q is a state and x is a string, and the arcs are from (q, x) to (δ(q, σ), x σ) where σ is either a symbol (weight 1, back of the double-ended queue) or ε (weight 0, front of the double-ended queue). By enumerating in this order, we can easily add a size limit.

We parse the regular expression with recursive descent. I'm not going to explain further; there are a million calculator tutorials out there.

Don't-run-this-in-production-grade C++ below.

#include <deque>
#include <iostream>
#include <tuple>
#include <unordered_set>
#include <vector>

struct State;

struct Transition {
  char symbol;
  State *successor;
};

struct State {
  std::vector<State *> epsilon_transitions;
  std::vector<Transition> transitions;
  std::unordered_set<std::string> visited_strings;
};

struct Automaton {
  State *initial_state;
  State *final_state;
};

Automaton Eps() {
  Automaton c;
  c.initial_state = c.final_state = new State;
  return c;
}

Automaton Sym(char symbol) {
  Automaton c;
  c.initial_state = new State;
  c.final_state = new State;
  c.initial_state->transitions.push_back({symbol, c.final_state});
  return c;
}

Automaton Star(Automaton a) {
  Automaton c;
  c.initial_state = c.final_state = new State;
  c.final_state->epsilon_transitions.push_back(a.initial_state);
  a.final_state->epsilon_transitions.push_back(c.final_state);
  return c;
}

Automaton Cat(Automaton a, Automaton b) {
  Automaton c;
  c.initial_state = a.initial_state;
  c.final_state = b.final_state;
  a.final_state->epsilon_transitions.push_back(b.initial_state);
  return c;
}

Automaton Alt(Automaton a, Automaton b) {
  Automaton c;
  c.initial_state = new State;
  c.final_state = new State;
  c.initial_state->epsilon_transitions.push_back(a.initial_state);
  c.initial_state->epsilon_transitions.push_back(b.initial_state);
  a.final_state->epsilon_transitions.push_back(c.final_state);
  b.final_state->epsilon_transitions.push_back(c.final_state);
  return c;
}

void Enumerate(Automaton a, std::size_t size_limit = 4) {
  std::deque<std::tuple<State *, std::string>> queue = {
      {a.initial_state, std::string()}};
  do {
    auto [state, string] = queue.front();
    if (string.size() > size_limit) {
      std::cout << "...\n";
      break;
    }
    queue.pop_front();
    if (!state->visited_strings.insert(string).second) {
      continue;
    }
    if (state == a.final_state) {
      std::cout << "\"" << string << "\"\n";
    }
    for (State *successor : state->epsilon_transitions) {
      queue.push_front({successor, string});
    }
    for (auto [symbol, successor] : state->transitions) {
      queue.push_back({successor, string + std::string(1, symbol)});
    }
  } while (!queue.empty());
  std::cout << "\n";
}

#include <cctype>
#include <cstdio>
#include <istream>
#include <optional>
#include <sstream>

char Peek(std::istream &input) {
  input >> std::ws;
  return input.peek();
}

bool Match(std::istream &input, char c) {
  if (Peek(input) == c) {
    input.get();
    return true;
  }
  return false;
}

std::optional<char> ParseSymbol(std::istream &input) {
  if (std::isalpha(Peek(input))) {
    return input.get();
  }
  return std::nullopt;
}

std::optional<Automaton> Parse(std::istream &input);

std::optional<Automaton> ParsePrimitive(std::istream &input) {
  if (auto maybe_symbol = ParseSymbol(input)) {
    return Sym(maybe_symbol.value());
  }
  if (Match(input, '(')) {
    if (auto maybe_a = Parse(input)) {
      if (Match(input, ')')) {
        return maybe_a.value();
      }
    }
  }
  return std::nullopt;
}

std::optional<Automaton> ParseFactor(std::istream &input) {
  if (auto maybe_a = ParsePrimitive(input)) {
    if (Match(input, '*')) {
      return Star(*maybe_a);
    }
    return *maybe_a;
  }
  return std::nullopt;
}

std::optional<Automaton> ParseTerm(std::istream &input) {
  if (auto maybe_a = ParseFactor(input)) {
    Automaton a = maybe_a.value();
    while (Match(input, '.')) {
      if (auto maybe_b = ParseFactor(input)) {
        a = Cat(a, maybe_b.value());
      } else {
        return std::nullopt;
      }
    }
    return a;
  }
  return std::nullopt;
}

std::optional<Automaton> Parse(std::istream &input) {
  if (auto maybe_a = ParseTerm(input)) {
    Automaton a = maybe_a.value();
    while (Match(input, '|')) {
      if (auto maybe_b = ParseTerm(input)) {
        a = Alt(a, maybe_b.value());
      } else {
        return std::nullopt;
      }
    }
    return a;
  }
  return std::nullopt;
}

std::optional<Automaton> Parse(const std::string &input) {
  std::stringstream stream(input);
  if (auto maybe_a = Parse(stream)) {
    if (Peek(stream) == EOF) {
      return maybe_a.value();
    }
  }
  return std::nullopt;
}

int main() {
  Enumerate(Parse("a").value());
  Enumerate(Parse("a|b").value());
  Enumerate(Parse("a.b").value());
  Enumerate(Parse("(a|b).c").value());
  Enumerate(Parse("(c|d).(a|b)").value());
  Enumerate(Parse("a*").value());
  Enumerate(Parse("(a|b)*").value());
}

Output:

"a"

"b"
"a"

"ab"

"bc"
"ac"

"db"
"da"
"cb"
"ca"

""
"a"
"aa"
"aaa"
"aaaa"
...

""
"b"
"a"
"bb"
"ba"
"ab"
"aa"
"bbb"
"bba"
"bab"
"baa"
"abb"
"aba"
"aab"
"aaa"
"bbbb"
"bbba"
"bbab"
"bbaa"
"babb"
"baba"
"baab"
"baaa"
"abbb"
"abba"
"abab"
"abaa"
"aabb"
"aaba"
"aaab"
"aaaa"
...

回复收藏 0 原文

~没有更多了~