PHP 的正则表达式。搜索单词并返回单词后的数据
我正在尝试为我被要求做的工作制作一个正则表达式,但我没有运气使它足够高效。
目标是使以下工作尽可能高效。
目标号 1. 使用句尾分隔所有文本(点、3 个点、感叹号...)。
目标数字 2 获取字符串 'em' 之后出现的所有数字
这是一个可能的小字符串及其正则表达式的示例。 (真人可真厉害)
正则表达式: 旧:(?:[^.!?:]|...)(?:(?:[^.!?:]|...)*?em (\d+))*
< br> 新:
<代码>(?:[.!?]|[.][.][.])(?:(?:[^.!?]|[.][.][.])*?\bem\ b (\d+))*
适用于字符串(我刚刚编的)
(我在开头插入 .)
.回顾 1939 年的战斗。 Claro 是 1939 年的数据。 Em 1938 já(插入 em 1910)não havia reis。
我想要的是制作一个不回溯的正则表达式,因为它根本不需要回溯。通过这样做,我想我可以节省这需要的处理,例如...从 30 秒减少到 20 秒,甚至减少到 10 秒!就为了这个1,需要1s才能完成。
添加:
谢谢你的答案,现在我有了一个不会失败的答案。但它仍然走回头路太多。有什么解决办法吗?
添加(回答一个已删除的问题):
不幸的是,我没有样本数据,谁要求我这样做,他说他也没有样本数据,这仍然需要“到昨天”完成。如果你给我一些可以尽可能高效地处理本文的东西,我确信我可以使用它并隐蔽地工作,如果需要的话,可以处理特定于这项工作的东西。不然我再来这里问一下。
I'm trying to make a regex for a work I've been asked to but I'm having no luck making it efficient enough.
The objective is to make the following as efficient as it can be.
Objective number 1. Separate all text using the sentence endings (dot, 3 dots, exclamation point...).
Objective number 2 Get all the numbers that appear after the string 'em'
Here's an example of a possible small string and a regex for it. (the real one can be really hudge)
The regex:
old:(?:[^.!?:]|...)(?:(?:[^.!?:]|...)*?em (\d+))*
new:(?:[.!?]|[.][.][.])(?:(?:[^.!?]|[.][.][.])*?\bem\b (\d+))*
works for the string (I just made it up)
(I insert the . in the begining)
.Foi visto que a batalha em 1939 foi. Claro que a data que digo ser em 1939 é uma farsa. Em 1938 já (insert em 1910) não havia reis.
What I wanted is to make a regex that does not backtrack as it simply does not need to backtrack. By making it like that I think I could save processing that this requires like... reducing from 30 seconds to 20s or even to 10s! Just for this1, it takes 1s to complete.
Add:
Thnx for the answers now I have one that does not fail. But still it does backtracks too much. Any solutions?
Add (to answer one deleted question):
Unfortunately I have no sample data, Who asked me to do this says he also does not have the sample data still this needs to be done "to yesterday". If you give me something that works with this text as efficient as it can be, I'm certain I can work with it and covert, if needed to something specific for this work. Else I'll ask here again.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尽管问题很令人困惑,但听起来您有两个不同的任务,最好使用两个不同的正则表达式来完成。这是一个经过测试的脚本,可以执行您想要的操作(我猜):
这是脚本的输出。
“em”后面有 4 个数字。
数字[1] = 1939
数字[2] = 1939
数字[3] = 1938
Number[4] = 1910
找到 3 个句子。
Sentence[1] =“Foi visto que a batalha em 1939 foi。”
Sentence[2] =“Claro que a data que digo ser em 1939 ├⌐ uma farsa。”
Sentence[3] =“Em 1938 j├í(插入 em 1910)n├úo havia reis。”
Although the question is confusing, it sounds like you have two different tasks which is best acomplished with two different regexes. Here is a tested script that does what (I'm guessing) you want:
Here is the output from the script.
There were 4 numbers following "em".
Number[1] = 1939
Number[2] = 1939
Number[3] = 1938
Number[4] = 1910
There were 3 sentences found.
Sentence[1] = "Foi visto que a batalha em 1939 foi."
Sentence[2] = "Claro que a data que digo ser em 1939 é uma farsa."
Sentence[3] = "Em 1938 já (insert em 1910) não havia reis."
我不会回答有关性能的问题,但是:
编辑不匹配:
关于 perf :匹配重复块内的任何内容 (.) 都会迫使引擎来回运行一段时间:如果您可以匹配显式模式,它总是会快得多。
I won't answer about performance but:
EDIT:
About perf : matching anything (.) inside a repetition block forces the engine to go back and forth quite a while : If you can match explicit patterns, it will always be much quicker.