正则表达式来匹配 . (句号标志着句子的结尾)但不是 Mr.(如霍普金斯先生)

发布于 2024-09-04 04:50:32 字数 153 浏览 4 评论 0原文

我正在尝试将文本文件解析为以句点结尾的句子,但像霍普金斯先生这样的名字会在匹配句点时发出错误警报。

什么正则表达式识别“.”但不是“先生”。

为了奖励,我也在使用!找到句子的结尾,所以我当前的正则表达式是 /(!/./ 并且我也希望有一个包含我的 ! 的答案。

I'm trying to parse a text file into sentences ending in periods, but names like Mr. Hopkins are throwing false alarms on matching for periods.

What regex identifies "." but not "Mr."

For bonus, I'm also using ! to find end of sentences, so my current Regex is /(!/./ and I'd love an answer that incorporates my !'s too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

勿忘初心 2024-09-11 04:50:32

使用负向查找

(?<!Mr|Mrs|Dr|Ms)\.

仅当句点不在 MrMrsDrMs 之后时,才会匹配句点

<?
   $str = "This is Mr. Someone and Mrs. Somebody. They are here to meet Dr. SomeoneElse.";
   $str = preg_replace("/(?<!Mr|Mrs|Dr|Ms)\\./", "\n", $str);
   echo($str);
?>
//outputs:
This is Mr. Someone and Mrs. Somebody
 They are here to meet Dr. SomeoneElse

Use negative look behind.

(?<!Mr|Mrs|Dr|Ms)\.

This will match a period only if it does not come after Mr, Mrs, Dr or Ms

<?
   $str = "This is Mr. Someone and Mrs. Somebody. They are here to meet Dr. SomeoneElse.";
   $str = preg_replace("/(?<!Mr|Mrs|Dr|Ms)\\./", "\n", $str);
   echo($str);
?>
//outputs:
This is Mr. Someone and Mrs. Somebody
 They are here to meet Dr. SomeoneElse
〃安静 2024-09-11 04:50:32

这无法通过任何简单的机制来完成。这是无可救药的暧昧。句子可以以缩写结尾,在这种情况下,句子不会用两个句点书写。

请参阅Unicode TR29。另请参阅 ICU 开源库,其中包括基本实现。

This can't be done with any simple mechanism. It's hopelessly ambiguous. Sentences can end with abbreviations, and in those cases they aren't written with two periods.

See Unicode TR29. Also see the ICU open source library, which includes a basic implementation.

痴意少年 2024-09-11 04:50:32

你的句子后面总是有两个空格吗?如果是这样,您可以检查...

/\.\s{2}/

并合并其他句末标点符号:
/[\.\!\?]\s{2}/

您还可以检查其他可能指示句子结尾的内容,例如下一个单词是否大写,后面是否有但最多你只能做出有根据的猜测,正如上面指出的那样,句号太模糊了。

Are your sentences always followed by two spaces? If so you could just check for that...

/\.\s{2}/

and incorporating other end of sentence punctuation:
/[\.\!\?]\s{2}/

You could also check other things which could be indicators of the end of a sentence, like if the next word is capitalized, is it followed by a carriage return, etc. But at best you'll just be able to make an educated guess, as pointed out above the period is just too ambiguous.

岁月静好 2024-09-11 04:50:32

正则表达式 (?<=[\.\!\?]\s[AZ]) 经过测试后几乎可以工作,但遗憾的是它在上一个匹配中留下了大写字母。解决此问题的方法是获取该字母并将其从上一场比赛中删除,同时将其添加回比赛本身。

示例:

//the string
string s = "The fox jumps over the dog. The dog jumps over the fox.";

string[] answer = Regex.Split(@"(?<=[\.\!\?]\s[A-Z])");

Console.WriteLine(answer);

输出为:[“The Fox Jumps Over the Dog.T”,“he Dog Jumps Over the Fox.”]

要解决此问题:

            //make sure there is a split
            if (lines.Length > 1)
            {
                for (int i = 0; i < lines.Length; i++)
                {
                    //store letter
                    char misplacedLetter = lines[i].TrimEnd().Last();

                    //remove letter
                    lines[i] = lines[i].Substring(0,lines[i].Length-1);

                    //place on front of next sentence.
                    lines[i + 1] = misplacedLetter + lines[i + 1];
                }
            }

这对我来说效果很好。 (您可以选择缓存lines[i],而不是一遍又一遍地访问它)

The regex (?<=[\.\!\?]\s[A-Z]) almost works after being tested, buts it sadly leaves the capital letter in the pervious match. A fix to this would be taking that letter and removing it from the previous match while adding it back to the match itself.

Example:

//the string
string s = "The fox jumps over the dog. The dog jumps over the fox.";

string[] answer = Regex.Split(@"(?<=[\.\!\?]\s[A-Z])");

Console.WriteLine(answer);

The output would be: ["The fox jumps over the dog. T","he dog jumps over the fox."]

To fix this:

            //make sure there is a split
            if (lines.Length > 1)
            {
                for (int i = 0; i < lines.Length; i++)
                {
                    //store letter
                    char misplacedLetter = lines[i].TrimEnd().Last();

                    //remove letter
                    lines[i] = lines[i].Substring(0,lines[i].Length-1);

                    //place on front of next sentence.
                    lines[i + 1] = misplacedLetter + lines[i + 1];
                }
            }

This worked for me well. (you may chose to cache lines[i] instead of accessing it over and over)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文