PHP 的正则表达式。搜索单词并返回单词后的数据

发布于 2024-11-04 02:30:49 字数 795 浏览 1 评论 0原文

我正在尝试为我被要求做的工作制作一个正则表达式,但我没有运气使它足够高效。
目标是使以下工作尽可能高效。
目标号 1. 使用句尾分隔所有文本(点、3 个点、感叹号...)。
目标数字 2 获取字符串 'em' 之后出现的所有数字
这是一个可能的小字符串及其正则表达式的示例。 (真人可真厉害)
正则表达式: 旧:
(?:[^.!?:]|...)(?:(?:[^.!?:]|...)*?em (\d+))*< br> 新:
<代码>(?:[.!?]|[.][.][.])(?:(?:[^.!?]|[.][.][.])*?\bem\ b (\d+))*

适用于字符串(我刚刚编的)
(我在开头插入 .)

.回顾 1939 年的战斗。 Claro 是 1939 年的数据。 Em 1938 já(插入 em 1910)não havia reis。

我想要的是制作一个不回溯的正则表达式,因为它根本不需要回溯。通过这样做,我想我可以节省这需要的处理,例如...从 30 秒减少到 20 秒,甚至减少到 10 秒!就为了这个1,需要1s才能完成。
添加:
谢谢你的答案,现在我有了一个不会失败的答案。但它仍然走回头路太多。有什么解决办法吗?

添加(回答一个已删除的问题):
不幸的是,我没有样本数据,谁要求我这样做,他说他也没有样本数据,这仍然需要“到昨天”完成。如果你给我一些可以尽可能高效地处理本文的东西,我确信我可以使用它并隐蔽地工作,如果需要的话,可以处理特定于这项工作的东西。不然我再来这里问一下。

I'm trying to make a regex for a work I've been asked to but I'm having no luck making it efficient enough.
The objective is to make the following as efficient as it can be.
Objective number 1. Separate all text using the sentence endings (dot, 3 dots, exclamation point...).
Objective number 2 Get all the numbers that appear after the string 'em'
Here's an example of a possible small string and a regex for it. (the real one can be really hudge)
The regex:
old:
(?:[^.!?:]|...)(?:(?:[^.!?:]|...)*?em (\d+))*
new:
(?:[.!?]|[.][.][.])(?:(?:[^.!?]|[.][.][.])*?\bem\b (\d+))*

works for the string (I just made it up)
(I insert the . in the begining)

.Foi visto que a batalha em 1939 foi. Claro que a data que digo ser em 1939 é uma farsa. Em 1938 já (insert em 1910) não havia reis.

What I wanted is to make a regex that does not backtrack as it simply does not need to backtrack. By making it like that I think I could save processing that this requires like... reducing from 30 seconds to 20s or even to 10s! Just for this1, it takes 1s to complete.
Add:
Thnx for the answers now I have one that does not fail. But still it does backtracks too much. Any solutions?

Add (to answer one deleted question):
Unfortunately I have no sample data, Who asked me to do this says he also does not have the sample data still this needs to be done "to yesterday". If you give me something that works with this text as efficient as it can be, I'm certain I can work with it and covert, if needed to something specific for this work. Else I'll ask here again.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

何时共饮酒 2024-11-11 02:30:49

尽管问题很令人困惑,但听起来您有两个不同的任务,最好使用两个不同的正则表达式来完成。这是一个经过测试的脚本,可以执行您想要的操作(我猜):

<?php // test.php 20110430_1100
    // Test data.
    $text = 'Foi visto que a batalha em 1939 foi. Claro'.
        ' que a data que digo ser em 1939 é uma farsa. E'.
        'm 1938 já (insert em 1910) não havia reis.';

    // Part 1: Find all numbers after "em".
    $re1 = '/\bem\b\s*(\d+)\b/i';
    $count = preg_match_all($re1, $text, $matches);
    if ($count) $numbers = $matches[1]; // Array of number strings.
    else        $numbers = array();     // Else no numbers found.

    // Part 2: Split text into sentences.
    $re2 = '/(?<=[.!?])\s+/';
    $sentences = preg_split($re2, $text, -1, PREG_SPLIT_NO_EMPTY);

    // Print out results.
    $ncnt = count($numbers); // Count of numbers found.
    printf("There were %d numbers following \"em\".\n", $ncnt);
    for ($i = 0; $i < $ncnt; ++$i) {
        printf("  Number[%d] = %s\n", $i + 1, $numbers[$i]);
    }
    $scnt = count($sentences); // Count of sentences found.
    printf("\nThere were %d sentences found.\n", $scnt);
    for ($i = 0; $i < $scnt; ++$i) {
        printf("  Sentence[%d] = \"%s\"\n", $i + 1, $sentences[$i]);
    }
?>

这是脚本的输出。

“em”后面有 4 个数字。
数字[1] = 1939
数字[2] = 1939
数字[3] = 1938
Number[4] = 1910

找到 3 个句子。
Sentence[1] =“Foi visto que a batalha em 1939 foi。”
Sentence[2] =“Claro que a data que digo ser em 1939 ├⌐ uma farsa。”
Sentence[3] =“Em 1938 j├í(插入 em 1910)n├úo havia reis。”

Although the question is confusing, it sounds like you have two different tasks which is best acomplished with two different regexes. Here is a tested script that does what (I'm guessing) you want:

<?php // test.php 20110430_1100
    // Test data.
    $text = 'Foi visto que a batalha em 1939 foi. Claro'.
        ' que a data que digo ser em 1939 é uma farsa. E'.
        'm 1938 já (insert em 1910) não havia reis.';

    // Part 1: Find all numbers after "em".
    $re1 = '/\bem\b\s*(\d+)\b/i';
    $count = preg_match_all($re1, $text, $matches);
    if ($count) $numbers = $matches[1]; // Array of number strings.
    else        $numbers = array();     // Else no numbers found.

    // Part 2: Split text into sentences.
    $re2 = '/(?<=[.!?])\s+/';
    $sentences = preg_split($re2, $text, -1, PREG_SPLIT_NO_EMPTY);

    // Print out results.
    $ncnt = count($numbers); // Count of numbers found.
    printf("There were %d numbers following \"em\".\n", $ncnt);
    for ($i = 0; $i < $ncnt; ++$i) {
        printf("  Number[%d] = %s\n", $i + 1, $numbers[$i]);
    }
    $scnt = count($sentences); // Count of sentences found.
    printf("\nThere were %d sentences found.\n", $scnt);
    for ($i = 0; $i < $scnt; ++$i) {
        printf("  Sentence[%d] = \"%s\"\n", $i + 1, $sentences[$i]);
    }
?>

Here is the output from the script.

There were 4 numbers following "em".
Number[1] = 1939
Number[2] = 1939
Number[3] = 1938
Number[4] = 1910

There were 3 sentences found.
Sentence[1] = "Foi visto que a batalha em 1939 foi."
Sentence[2] = "Claro que a data que digo ser em 1939 é uma farsa."
Sentence[3] = "Em 1938 já (insert em 1910) não havia reis."

作业与我同在 2024-11-11 02:30:49

我不会回答有关性能的问题,但是:

  • 您不应该使用“...”来匹配...,而是使用“...”(否则,您将匹配任何 3 个字符的序列)。请注意,这可能会大大提高您的性能。
  • 我不会说那种语言(西班牙语),但我认为您只想匹配单词“em”,而不是终止符(例如 balahem 1930 将匹配)。
  • 您不应该假设“em”和您的号码之间只有一个空格:Em__1950(用空格替换 _)与

编辑不匹配:
关于 perf :匹配重复块内的任何内容 (.) 都会迫使引擎来回运行一段时间:如果您可以匹配显式模式,它总是会快得多。

I won't answer about performance but:

  • you shouldn't use '...' to match ... but '...' (otherwise, you match any sequence of 3 chars). Note that this might improve greatly your perf.
  • I don't speak that language (spanish), but I think you want to match only the word "em", not the termination (e.g. balahem 1930 would match).
  • you should not assume that you have only one space between 'em' and your number: Em__1950 (replace _ by space) would not match

EDIT:
About perf : matching anything (.) inside a repetition block forces the engine to go back and forth quite a while : If you can match explicit patterns, it will always be much quicker.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文