针对每行上的多个（15）正则表达式解析文本正文的最佳方法是什么？

发布于 2024-07-09 00:24:22 字数 1364 浏览 8 评论 0原文

我有一个必须扫描的文本正文，每行至少包含 2 个部分，有时包含 4 个部分的信息。问题是每行可以是 15-20 个不同操作中的 1 个。

在 ruby 中，当前代码看起来有点像这样：

text.split("\n").each do |line|  #around 20 times..

..............

      expressions['actions'].each do |pat, reg| #around 20 times

.................

这显然是“问题”。我确实设法通过将所有正则表达式合并为一个来使其更快（在 C++ 中提高了 50%），但这仍然不是我需要的速度——我需要快速解析数千个这样的文件！

现在我用正则表达式来匹配它们——然而这太慢了。我从 ruby 开始，然后跳到 C++，希望能够提高速度，但它并没有发生。

我偶然读过 PEG 和基于语法的解析，但它看起来有点难以实现。这是我应该前进的方向还是有不同的路线？

基本上我正在解析扑克手牌历史记录，手牌历史记录的每一行通常包含我需要收集的 2-3 位信息：玩家是谁，该操作需要多少钱或什么牌......等等。

需要解析的示例文本：

buriedtens posts $5
The button is in seat #4
*** HOLE CARDS ***
Dealt to Mayhem 31337 [8s Ad]
Sherwin7 folds
OneMiKeee folds
syhg99 calls $5
buriedtens raises to $10

在我收集此信息后，每个操作都会变成一个 xml 节点。

现在我的 ruby 实现比我的 C++ 快得多，但这是有可能的。只是因为我已经 4-5 年没有用 C 代码编写

更新： 我不想在这里发布所有代码，但到目前为止我的手/秒看起来如下：

588 hands/second -- boost::spirit in c++
60 hands/second -- 1 very long and complicated regex in c++ (all the regexen put together)
33 hands/second -- normal regex style in ruby

我目前正在测试antlr，看看我们是否可以更进一步，但截至目前，我对spirit非常满意结果。

相关问题：根据多个正则表达式高效查询一个字符串。

原文

I have a body of text that I have to scan and each line contains at least 2 and sometimes four parts of information. The problem is that each line can be 1 out of 15-20 different actions.

in ruby the current code looks somewhat like this:

text.split("\n").each do |line|  #around 20 times..

..............

      expressions['actions'].each do |pat, reg| #around 20 times

.................

This is obviously 'THE PROBLEM'.
I did manage to make it faster (in C++ by a 50% margin) by combining all the regexen into one but that is still not the speed I require -- I need to parse thousands of these files FAST!

Right now I match them with regexes -- however this is intolerably slow. I started with ruby and hopped over to C++ in hopes that I'd get a speed boost and it just isn't happening.

I've casually read on PEGs and grammar based parsing but it looks somewhat difficult to implement. Is this the direction I should head or are there different routes?

basically I'm parsing poker hand histories and each line of the hand history usually contains 2-3 bits of information that I need to collect:
who the player was, how much money or what cards the action entailed.. etc..

Sample text that needs to be parsed:

buriedtens posts $5
The button is in seat #4
*** HOLE CARDS ***
Dealt to Mayhem 31337 [8s Ad]
Sherwin7 folds
OneMiKeee folds
syhg99 calls $5
buriedtens raises to $10

After I collect this information each action is turned into an xml node.

Right now my ruby implementation of this is much faster than my C++ one but that's prob. Just cause I have not written in c code for well over 4-5 years

UPDATE:
I don't want to post all the code here but so far my hands/second look like the following:

588 hands/second -- boost::spirit in c++
60 hands/second -- 1 very long and complicated regex in c++ (all the regexen put together)
33 hands/second -- normal regex style in ruby

I'm currently testing antlr to see if we can go any further but as of right now I'm very very happy with spirit's results.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

壹場煙雨 2024-07-16 00:24:23

如果您使用 Perl，这是一种方法。
复制自 perldoc perlfaq6

while (<>) {
    chomp;
    PARSER: {
        m/ \G( \d+\b    )/gcx   && do { print "number: $1\n";  redo; };
        m/ \G( \w+      )/gcx   && do { print "word:   $1\n";  redo; };
        m/ \G( \s+      )/gcx   && do { print "space:  $1\n";  redo; };
        m/ \G( [^\w\d]+ )/gcx   && do { print "other:  $1\n";  redo; };
    }
}

对于每一行，PARSER 循环首先尝试匹配一系列数字，后跟单词边界。该匹配必须从最后一个匹配结束的位置开始（或第一个匹配的字符串的开头）。由于 m/ \G( \d+\b )/gcx 使用 c 标志，如果字符串与正则表达式不匹配，perl 不会重置 pos () 并且下一个匹配从相同的位置开始尝试不同的模式。

Here is one way of doing it, if you were using Perl.
copied from perldoc perlfaq6

while (<>) {
    chomp;
    PARSER: {
        m/ \G( \d+\b    )/gcx   && do { print "number: $1\n";  redo; };
        m/ \G( \w+      )/gcx   && do { print "word:   $1\n";  redo; };
        m/ \G( \s+      )/gcx   && do { print "space:  $1\n";  redo; };
        m/ \G( [^\w\d]+ )/gcx   && do { print "other:  $1\n";  redo; };
    }
}

For each line, the PARSER loop first tries to match a series of digits followed by a word boundary. This match has to start at the place the last match left off (or the beginning of the string on the first match). Since m/ \G( \d+\b )/gcx uses the c flag, if the string does not match that regular expression, perl does not reset pos() and the next match starts at the same position to try a different pattern.

回复收藏 0 原文

烟织青萝梦 2024-07-16 00:24:23

请参阅正则表达式匹配可以简单而快速
（但在 Java、Perl、PHP、Python、Ruby 等中速度很慢）。根据数据量和正则表达式的复杂程度，编写自己的解析逻辑可能会更快。

回复收藏 0 原文

沙沙粒小 2024-07-16 00:24:23

我偶然读过 PEG 和基于语法的解析，但它看起来有点难以实现。这是我应该前进的方向还是有不同的路线？

就我个人而言，我越来越喜欢 PEG。也许需要一些时间才能适应它们，但我认为它们更易于维护，因此这是一个明显的胜利。我发现当您在输入中发现新的边缘情况时，解析代码是许多意外错误的根源。与循环和条件繁重的正则表达式代码相比，当发生这种情况时，带有非终结符的声明性语法更容易更新。命名的力量是强大的。

在 Ruby 中，有 Treetop，它是一个使用 PEG 的解析器生成器。我最近发现用简短的语法替换正则表达式繁重的手写解析器非常令人愉快。

回复收藏 0 原文

岁月蹉跎了容颜 2024-07-16 00:24:23

正则表达式匹配是否重叠？也就是说，当两个或多个正则表达式匹配同一行时，它们是否总是匹配该行的不同部分（无重叠）？

如果匹配从不重叠，请使用一个正则表达式来运行搜索，该正则表达式组合了您现在拥有的 15 个正则表达式：

regex1|regex2|regex3|...|regex15

如果您需要能够确定 15 个正则表达式中的哪一个匹配，请使用捕获组。

对于长正则表达式搜索数据一次将比搜索 15 次更快。快多少取决于您使用的正则表达式引擎以及正则表达式的复杂性。

Do the regular expression matches ever overlap? That is, when two or more regexes match the same line, do they always match different parts of the line (no overlap)?

If the matches never overlap, run your search using one regular expression that combines the 15 regexes you have now:

regex1|regex2|regex3|...|regex15

Use capturing groups if you need to be able to determine which of the 15 regexes matched.

Searching your data once for a long regex will be faster than searching it 15 times. How much faster depends on the regex engine you're using and the complexity of your regular expressions.

回复收藏 0 原文

萌梦深 2024-07-16 00:24:23

尝试用 Perl 进行一个简单的测试。了解“学习”功能。我可能会尝试的是：

读取整个文件或大量行（如果这些文件非常大）读取到单个字符串中
在每行的开头添加行号。
“研究”字符串。这会按字符构建一个查找表，可以很大。
对字符串运行正则表达式匹配，以换行符为界（使用 m 和 s 正则表达式修饰符）。表达式应提取行号和数据。
将按行号索引的数组项设置为该行上找到的数据，或者执行更智能的操作。
最后就可以处理数组中存储的数据了。

我没有尝试过，但可能会很有趣。

回复收藏 0 原文

山川志 2024-07-16 00:24:23

如果您有一个漂亮的四核或八核服务器可以用于此目的，那么另一个想法是。

构建一个划分工作的处理管道。第一阶段可以将文件分割成一场比赛或每场比赛，然后将每个文件写入八个第二阶段管道中的一个，这些管道读取数据，处理数据并以某种方式产生输出，可能会写入另一台机器上的数据库。

根据我的经验，这些基于管道的多进程设计几乎与多线程设计一样快且更容易调试。使用网络套接字而不是管道来设置机器集群也很容易。

回复收藏 0 原文

缱绻入梦 2024-07-16 00:24:23

好的，这让事情变得更清晰（扑克手牌历史）。我猜你正在制作一个统计工具（侵略因素、摊牌、自愿将美元放入彩池等）。我不知道为什么你需要过高的速度；即使您在 16 张牌桌上进行多桌游戏，牌局也只能以适度的速度进行。

我不了解 Ruby，但在 Perl 中我会做一些 switch 语句，同时将重要部分放入 $1、$2 等中。根据我的经验，这并不比进行字符串比较然后拆分慢用其他方式划线。

HAND_LINE: for ($Line)
    { /^\*\*\* ([A-Z ]+)/ and do 
        { # parse the string that is captured in $1
          last HAND_LINE; };
      /^Dealt to (.+) \[(.. ..)\]$/ and do
        { # $1 contains the name, $2 contains the cards as string
          last HAND_LINE; };
      /(.+) folds$/ and do
        { # you get the drift
          last HAND_LINE; }; };

我不认为你真的可以让它更快。将检查出现次数最多的行放在第一个位置（可能是折叠语句），以及最后很少出现的行（从新手开始，“*** NEXT PHASE ***” ）。

如果您发现实际的文件读取是一个瓶颈，您也许可以看看可以使用哪些模块来处理大文件；对于 Perl，我会想到 Tie::File。

确保每手牌只读一次。每手牌之后不要再次读取所有数据，而是保留例如已解析的手牌 ID 的哈希表。

OK, this makes things clearer (Poker hand histories). I guess that you are making a statistics tool (aggression factor, went to showdown, voluntarily put $ into pot etc.). I am not sure why you need excessive speeds for that; even if you are multitabling with 16 tables, the hands should only tickle in at a moderate rate.

I don't know Ruby, but in Perl I'd do a little switch statement, at the same time as getting the significant parts into $1, $2 etc.. In my experience, this is not slower than making string comparisons and then splitting the line with other means.

HAND_LINE: for ($Line)
    { /^\*\*\* ([A-Z ]+)/ and do 
        { # parse the string that is captured in $1
          last HAND_LINE; };
      /^Dealt to (.+) \[(.. ..)\]$/ and do
        { # $1 contains the name, $2 contains the cards as string
          last HAND_LINE; };
      /(.+) folds$/ and do
        { # you get the drift
          last HAND_LINE; }; };

I do no think that you can really make it faster. Put the checks for the lines that occur the most at a first position (likely the fold statements) and those that only occur sparsely at last (starting new hand, "*** NEXT PHASE ***").

If you find out that the actual file reading is a bottleneck, you can perhaps take a look at what modules you can use to address large files; for Perl, Tie::File comes to mind.

Make sure that you read each hand only once. Do not read all data again after each hand, instead keep e.g. a hash table of the hand IDs already parsed.

回复收藏 0 原文