正则表达式来匹配 . (句号标志着句子的结尾)但不是 Mr.(如霍普金斯先生)
我正在尝试将文本文件解析为以句点结尾的句子,但像霍普金斯先生这样的名字会在匹配句点时发出错误警报。
什么正则表达式识别“.”但不是“先生”。
为了奖励,我也在使用!找到句子的结尾,所以我当前的正则表达式是 /(!/./ 并且我也希望有一个包含我的 ! 的答案。
I'm trying to parse a text file into sentences ending in periods, but names like Mr. Hopkins are throwing false alarms on matching for periods.
What regex identifies "." but not "Mr."
For bonus, I'm also using ! to find end of sentences, so my current Regex is /(!/./ and I'd love an answer that incorporates my !'s too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用负向查找。
仅当句点不在
Mr
、Mrs
、Dr
或Ms
之后时,才会匹配句点Use negative look behind.
This will match a period only if it does not come after
Mr
,Mrs
,Dr
orMs
这无法通过任何简单的机制来完成。这是无可救药的暧昧。句子可以以缩写结尾,在这种情况下,句子不会用两个句点书写。
请参阅Unicode TR29。另请参阅 ICU 开源库,其中包括基本实现。
This can't be done with any simple mechanism. It's hopelessly ambiguous. Sentences can end with abbreviations, and in those cases they aren't written with two periods.
See Unicode TR29. Also see the ICU open source library, which includes a basic implementation.
你的句子后面总是有两个空格吗?如果是这样,您可以检查...
/\.\s{2}/
并合并其他句末标点符号:
/[\.\!\?]\s{2}/
您还可以检查其他可能指示句子结尾的内容,例如下一个单词是否大写,后面是否有但最多你只能做出有根据的猜测,正如上面指出的那样,句号太模糊了。
Are your sentences always followed by two spaces? If so you could just check for that...
/\.\s{2}/
and incorporating other end of sentence punctuation:
/[\.\!\?]\s{2}/
You could also check other things which could be indicators of the end of a sentence, like if the next word is capitalized, is it followed by a carriage return, etc. But at best you'll just be able to make an educated guess, as pointed out above the period is just too ambiguous.
正则表达式
(?<=[\.\!\?]\s[AZ])
经过测试后几乎可以工作,但遗憾的是它在上一个匹配中留下了大写字母。解决此问题的方法是获取该字母并将其从上一场比赛中删除,同时将其添加回比赛本身。示例:
输出为:
[“The Fox Jumps Over the Dog.T”,“he Dog Jumps Over the Fox.”]
要解决此问题:
这对我来说效果很好。 (您可以选择缓存
lines[i]
,而不是一遍又一遍地访问它)The regex
(?<=[\.\!\?]\s[A-Z])
almost works after being tested, buts it sadly leaves the capital letter in the pervious match. A fix to this would be taking that letter and removing it from the previous match while adding it back to the match itself.Example:
The output would be:
["The fox jumps over the dog. T","he dog jumps over the fox."]
To fix this:
This worked for me well. (you may chose to cache
lines[i]
instead of accessing it over and over)