如何用正则表达式匹配英文文章中的英文单词?谢谢^_^

发布于 2022-09-12 23:43:01 字数 944 浏览 19 评论 0

题目描述

需求:Java写一个程序,汇总文章中每个英文单词的个数。判断一个单词时,需要考虑前后的空格,换行字符以及连接”-”符号,连接符会将一个词组成一个整体,用正则表达式实现,具体规则如下:

  1. 以下当作一个词:
    don't, doesn't, didn't, can't, couldn't, wouldn't, isn't, aren't, wasn't, weren't
  2. 以下当作一个词处理:
    he's, she's, I'm, you're, we're, they're
  3. 以下不计入统计,删除
    Shawn's, apple's, Jonas’, what's, 'twas
  4. ice-cream 如果不在行尾换行时,当作一个词,但是不能删掉中间连接符

题目来源及自己的思路

看了一些资料,写了一个初稿,
(?:she's|he's|they're|we're|you're|I'm|It's)|(?:isn't|aren't|doesn't|don't|didn't|haven't|hadn't|hasn't|can't|couldn't|wasn't|weren't|wouldn't )

测试字符串为:
She's"1.tom:'what's your name.' Jame's Janes', didn't, character,wasn't,
ice-cream,

相关代码

(?:she's|he's|they're|we're|you're|I'm|It's)|(?:isn't|aren't|doesn't|don't|didn't|haven't|hadn't|hasn't|can't|couldn't|wasn't|weren't|wouldn't )

你期待的结果是什么?实际看到的错误信息又是什么?

但是不能正确判断单词、连接符和换行符。

谢谢老司机领路!帮我设计这个正则表达式 ^_^

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

怪我入戏太深 2022-09-19 23:43:01

基本上满足你的要求

    public static int countWordsUsingRegex(String arg) {
        if (arg == null) {
            return 0;
        }

        // - 换行 自己调整 -\n
        final String[] words = arg.split("[\p{Punct}|\s&&[^']&&[^-]]+|\s+Shawn's\s+|\s+apple's\s+|\s+Jonas'\s+|\s+what's\s+|\s+'twas\s+|'s\s+|-\n");

        for (String word : words) {
            System.out.println(word);
        }

        return words.length;
    }
一萌ing 2022-09-19 23:43:01

这样子写可以吗?稍微麻烦了点但是正确率应该算高

while (tokenizer1.hasMoreTokens()) {

        word = tokenizer1.nextToken(" ,?.!:;\"\"`()[]\n'").toLowerCase();
        if (word.matches("[a-z]+-") ) {//如果遇到结尾时连接符的单词
            word2 = tokenizer2.nextToken(" ,?.!:;'-");//截取第二行的第一个单词的后半段;
            word += word2;
        }else if(word.matches("\'")) {
            if (word.matches("isn't|aren't|doesn't|don't|didn't|haven't|hadn't|hasn't|can't|couldn't|wasn't|weren't|wouldn't")) {
                
            }
            }
        else if (word.length() <= 1) {
            word = "";
        }
        }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文