如何使用扫描仪分隔符（包括 Java 中的单引号或撇号）过滤掉文本文件中的非字母

发布于 2024-10-01 13:17:21 字数 922 浏览 0 评论 0原文

请对文件中的每个单词进行计数，并且此计数不应包括撇号、逗号、句号、问号、感叹号等非字母，即字母表中的字母。我尝试使用这样的分隔符，但它不包含撇号。

Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    int totalWordCount = 0;

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");

  //Then later I create an array to store each individual word in the file for counting their lengths.
    Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    String[] words = new String[totalWordCount];
    for (int i = 0; i < totalWordCount; ++i) {
        words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
    }

这似乎不起作用！

请问我该怎么办？

原文

Pls I want to keep a count of every word from a file, and this count should not include non letters like the apostrophe, comma, fullstop, question mark, exclamation mark, e.t.c. i.e just letters of the alphabet.
I tried to use a delimiter like this, but it didn't include the apostrophe.

Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    int totalWordCount = 0;

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");

  //Then later I create an array to store each individual word in the file for counting their lengths.
    Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    String[] words = new String[totalWordCount];
    for (int i = 0; i < totalWordCount; ++i) {
        words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
    }

This doesn't seem to work !

Please how can I go about this ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧瑾黎汐 2024-10-08 13:17:21

在我看来，除了空格和结束线之外，您不想使用任何内容进行过滤。例如，如果您使用 ' 来过滤单词数，则单词“they're”将作为两个单词返回。以下是您可以更改原始代码以使其正常工作的方法。

Scanner fileScanner = new Scanner(new File("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt"));
    int totalWordCount = 0;
    ArrayList<String> words = new ArrayList<String>();

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        //Add words to an array list so you only have to go through the scanner once
        words.add(fileScanner.next());//This defaults to whitespace
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");
    fileScanner.close();

使用 Pattern.compile() 将字符串转换为正则表达式。 '\s' 字符是在 Pattern 类中预定义的，用于匹配所有空白字符。

有更多信息，请访问
模式文档

另外，请确保在完成后关闭您的扫描仪类。这可能会阻止您的第二个扫描仪打开。

编辑

如果您想计算每个单词的字母数，您可以将以下代码添加到上面的代码中，

int totalLetters = 0;
int[] lettersPerWord = new int[words.size()];
for (int wordNum = 0; wordNum < words.size(); wordNum++)
{
 String word = words.get(wordNum);
 word = word.replaceAll("[.,:;()?!\" \t\n\r\']+", "");
 lettersPerWord[wordNum] = word.length();
 totalLetters = word.length();
}

我已经测试了此代码，它似乎对我有用。 replaceAll，根据JavaDoc 使用正则表达式进行匹配，因此它应该匹配任何这些字符并从本质上删除它。

Seems to me that you don't want to filter using anything but spaces and end lines. For example the word "they're" would return as two words if you're using a ' to filter your number of words. Here's how you could change your original code to make it work.

Scanner fileScanner = new Scanner(new File("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt"));
    int totalWordCount = 0;
    ArrayList<String> words = new ArrayList<String>();

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        //Add words to an array list so you only have to go through the scanner once
        words.add(fileScanner.next());//This defaults to whitespace
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");
    fileScanner.close();

Using the Pattern.compile() turns your string into a regular expression. The '\s' character is predefined in the Pattern class to match all white space characters.

There is more information at
Pattern Documentation

Also, make sure to close your Scanner classes when you're done. This could prevent your second scanner from opening.

Edit

If you want to count the letters per word you can add the following code to the above code

int totalLetters = 0;
int[] lettersPerWord = new int[words.size()];
for (int wordNum = 0; wordNum < words.size(); wordNum++)
{
 String word = words.get(wordNum);
 word = word.replaceAll("[.,:;()?!\" \t\n\r\']+", "");
 lettersPerWord[wordNum] = word.length();
 totalLetters = word.length();
}

I have tested this code and it appears to work for me. The replaceAll, according to the JavaDoc uses a regular expression to match so it should match any of those characters and essentially remove it.

回复收藏 0 原文

我爱人 2024-10-08 13:17:21

分隔符不是正则表达式，因此在您的示例中，它正在查找 "[.,:;()?!\" \t\n\r]+" 之间分割的内容

您可以使用 regexp 而不是

使用分隔符带有 group 方法的 regexp 类可能就是您所寻找的，

String pattern = "(.*)[.,:;()?!\" \t\n\r]+(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
    if (m.find( )) {
        System.out.println("Found value: " + m.group(1) );
    }

尝试一下这些类，您会发现它与您需要的更加相似。

The Delimiter is not a regular expression, so with your example it is looking for things split between "[.,:;()?!\" \t\n\r]+"

You can either use regexp instead of the Delimiter

using the regexp class with the group method may be what your looking for.

String pattern = "(.*)[.,:;()?!\" \t\n\r]+(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
    if (m.find( )) {
        System.out.println("Found value: " + m.group(1) );
    }

Play with those classes and you will see it is much more similar to what you need

回复收藏 0 原文