当前位置：文江博客话题详情

正则表达式来匹配 . （句号标志着句子的结尾）但不是 Mr.（如霍普金斯先生）

发布于 2024-09-04 04:50:32 字数 153 浏览 4 评论 0原文

我正在尝试将文本文件解析为以句点结尾的句子，但像霍普金斯先生这样的名字会在匹配句点时发出错误警报。

什么正则表达式识别“.”但不是“先生”。

为了奖励，我也在使用！找到句子的结尾，所以我当前的正则表达式是 /(!/./ 并且我也希望有一个包含我的 ! 的答案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

勿忘初心 2024-09-11 04:50:32

使用负向查找。

(?<!Mr|Mrs|Dr|Ms)\.

仅当句点不在 Mr、Mrs、Dr 或 Ms 之后时，才会匹配句点

<?
   $str = "This is Mr. Someone and Mrs. Somebody. They are here to meet Dr. SomeoneElse.";
   $str = preg_replace("/(?<!Mr|Mrs|Dr|Ms)\\./", "\n", $str);
   echo($str);
?>
//outputs:
This is Mr. Someone and Mrs. Somebody
 They are here to meet Dr. SomeoneElse

Use negative look behind.

(?<!Mr|Mrs|Dr|Ms)\.

This will match a period only if it does not come after Mr, Mrs, Dr or Ms

<?
   $str = "This is Mr. Someone and Mrs. Somebody. They are here to meet Dr. SomeoneElse.";
   $str = preg_replace("/(?<!Mr|Mrs|Dr|Ms)\\./", "\n", $str);
   echo($str);
?>
//outputs:
This is Mr. Someone and Mrs. Somebody
 They are here to meet Dr. SomeoneElse

回复收藏 0 原文

〃安静 2024-09-11 04:50:32

这无法通过任何简单的机制来完成。这是无可救药的暧昧。句子可以以缩写结尾，在这种情况下，句子不会用两个句点书写。

请参阅Unicode TR29。另请参阅 ICU 开源库，其中包括基本实现。

回复收藏 0 原文

痴意少年 2024-09-11 04:50:32

你的句子后面总是有两个空格吗？如果是这样，您可以检查...

/\.\s{2}/

并合并其他句末标点符号：
/[\.\!\?]\s{2}/

您还可以检查其他可能指示句子结尾的内容，例如下一个单词是否大写，后面是否有但最多你只能做出有根据的猜测，正如上面指出的那样，句号太模糊了。

回复收藏 0 原文

岁月静好 2024-09-11 04:50:32

正则表达式 (?<=[\.\!\?]\s[AZ]) 经过测试后几乎可以工作，但遗憾的是它在上一个匹配中留下了大写字母。解决此问题的方法是获取该字母并将其从上一场比赛中删除，同时将其添加回比赛本身。

示例：

//the string
string s = "The fox jumps over the dog. The dog jumps over the fox.";

string[] answer = Regex.Split(@"(?<=[\.\!\?]\s[A-Z])");

Console.WriteLine(answer);

输出为：[“The Fox Jumps Over the Dog.T”,“he Dog Jumps Over the Fox.”]

要解决此问题：

            //make sure there is a split
            if (lines.Length > 1)
            {
                for (int i = 0; i < lines.Length; i++)
                {
                    //store letter
                    char misplacedLetter = lines[i].TrimEnd().Last();

                    //remove letter
                    lines[i] = lines[i].Substring(0,lines[i].Length-1);

                    //place on front of next sentence.
                    lines[i + 1] = misplacedLetter + lines[i + 1];
                }
            }

这对我来说效果很好。（您可以选择缓存lines[i]，而不是一遍又一遍地访问它）

The regex (?<=[\.\!\?]\s[A-Z]) almost works after being tested, buts it sadly leaves the capital letter in the pervious match. A fix to this would be taking that letter and removing it from the previous match while adding it back to the match itself.

Example:

//the string
string s = "The fox jumps over the dog. The dog jumps over the fox.";

string[] answer = Regex.Split(@"(?<=[\.\!\?]\s[A-Z])");

Console.WriteLine(answer);

The output would be: ["The fox jumps over the dog. T","he dog jumps over the fox."]

To fix this:

            //make sure there is a split
            if (lines.Length > 1)
            {
                for (int i = 0; i < lines.Length; i++)
                {
                    //store letter
                    char misplacedLetter = lines[i].TrimEnd().Last();

                    //remove letter
                    lines[i] = lines[i].Substring(0,lines[i].Length-1);

                    //place on front of next sentence.
                    lines[i + 1] = misplacedLetter + lines[i + 1];
                }
            }

This worked for me well. (you may chose to cache lines[i] instead of accessing it over and over)

回复收藏 0 原文

~没有更多了~