如何使用扫描仪分隔符(包括 Java 中的单引号或撇号)过滤掉文本文件中的非字母

发布于 2024-10-01 13:17:21 字数 922 浏览 0 评论 0原文

请对文件中的每个单词进行计数,并且此计数不应包括撇号、逗号、句号、问号、感叹号等非字母,即字母表中的字母。 我尝试使用这样的分隔符,但它不包含撇号。

Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    int totalWordCount = 0;

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");

  //Then later I create an array to store each individual word in the file for counting their lengths.
    Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    String[] words = new String[totalWordCount];
    for (int i = 0; i < totalWordCount; ++i) {
        words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
    }

这似乎不起作用!

请问我该怎么办?

Pls I want to keep a count of every word from a file, and this count should not include non letters like the apostrophe, comma, fullstop, question mark, exclamation mark, e.t.c. i.e just letters of the alphabet.
I tried to use a delimiter like this, but it didn't include the apostrophe.

Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    int totalWordCount = 0;

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");

  //Then later I create an array to store each individual word in the file for counting their lengths.
    Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
    String[] words = new String[totalWordCount];
    for (int i = 0; i < totalWordCount; ++i) {
        words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
    }

This doesn't seem to work !

Please how can I go about this ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

旧瑾黎汐 2024-10-08 13:17:21

在我看来,除了空格和结束线之外,您不想使用任何内容进行过滤。例如,如果您使用 ' 来过滤单词数,则单词“they're”将作为两个单词返回。以下是您可以更改原始代码以使其正常工作的方法。

Scanner fileScanner = new Scanner(new File("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt"));
    int totalWordCount = 0;
    ArrayList<String> words = new ArrayList<String>();

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        //Add words to an array list so you only have to go through the scanner once
        words.add(fileScanner.next());//This defaults to whitespace
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");
    fileScanner.close();

使用 Pattern.compile() 将字符串转换为正则表达式。 '\s' 字符是在 Pattern 类中预定义的,用于匹配所有空白字符。

有更多信息,请访问
模式文档

另外,请确保在完成后关闭您的扫描仪类。这可能会阻止您的第二个扫描仪打开。

编辑

如果您想计算每个单词的字母数,您可以将以下代码添加到上面的代码中,

int totalLetters = 0;
int[] lettersPerWord = new int[words.size()];
for (int wordNum = 0; wordNum < words.size(); wordNum++)
{
 String word = words.get(wordNum);
 word = word.replaceAll("[.,:;()?!\" \t\n\r\']+", "");
 lettersPerWord[wordNum] = word.length();
 totalLetters = word.length();
}

已经测试了此代码,它似乎对我有用。 replaceAll,根据JavaDoc 使用正则表达式进行匹配,因此它应该匹配任何这些字符并从本质上删除它。

Seems to me that you don't want to filter using anything but spaces and end lines. For example the word "they're" would return as two words if you're using a ' to filter your number of words. Here's how you could change your original code to make it work.

Scanner fileScanner = new Scanner(new File("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt"));
    int totalWordCount = 0;
    ArrayList<String> words = new ArrayList<String>();

    //Firstly to count all the words in the file without the restricted characters 
    while (fileScanner.hasNext()) {
        //Add words to an array list so you only have to go through the scanner once
        words.add(fileScanner.next());//This defaults to whitespace
        totalWordCount++;
    }
    System.out.println("There are " + totalWordCount + " word(s)");
    fileScanner.close();

Using the Pattern.compile() turns your string into a regular expression. The '\s' character is predefined in the Pattern class to match all white space characters.

There is more information at
Pattern Documentation

Also, make sure to close your Scanner classes when you're done. This could prevent your second scanner from opening.

Edit

If you want to count the letters per word you can add the following code to the above code

int totalLetters = 0;
int[] lettersPerWord = new int[words.size()];
for (int wordNum = 0; wordNum < words.size(); wordNum++)
{
 String word = words.get(wordNum);
 word = word.replaceAll("[.,:;()?!\" \t\n\r\']+", "");
 lettersPerWord[wordNum] = word.length();
 totalLetters = word.length();
}

I have tested this code and it appears to work for me. The replaceAll, according to the JavaDoc uses a regular expression to match so it should match any of those characters and essentially remove it.

我爱人 2024-10-08 13:17:21

分隔符不是正则表达式,因此在您的示例中,它正在查找 "[.,:;()?!\" \t\n\r]+" 之间分割的内容

您可以使用 regexp 而不是

使用 分隔符带有 group 方法的 regexp 类可能就是您所寻找的,

String pattern = "(.*)[.,:;()?!\" \t\n\r]+(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
    if (m.find( )) {
        System.out.println("Found value: " + m.group(1) );
    }

尝试一下这些类,您会发现它与您需要的更加相似。

The Delimiter is not a regular expression, so with your example it is looking for things split between "[.,:;()?!\" \t\n\r]+"

You can either use regexp instead of the Delimiter

using the regexp class with the group method may be what your looking for.

String pattern = "(.*)[.,:;()?!\" \t\n\r]+(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
    if (m.find( )) {
        System.out.println("Found value: " + m.group(1) );
    }

Play with those classes and you will see it is much more similar to what you need

谜兔 2024-10-08 13:17:21

您可以在分隔符中尝试这个正则表达式:
fileScanner.useDelimiter(("[^a-zA-Z]|[^\']")).next();

这将使用任何非字母字符或非撇号作为分隔符。这样,您的单词将包含撇号,但不包含任何其他非字母字符。

然后,如果您希望长度准确,则必须循环遍历每个单词并检查撇号并考虑它们。您可以删除每个撇号,长度将与单词中的字母数相匹配,或者您可以创建具有自己的长度字段的单词对象,以便您可以按原样打印单词,并知道其中的字母字符数单词。

You could try this regex in your delimiter:
fileScanner.useDelimiter(("[^a-zA-Z]|[^\']")).next();

This will use any non-letter character OR non apostrophe as a delimiter. That way your words will include the apostrophe but not any other non-letter character.

Then you'll have to loop through each word and check for apostrophe's and account for them if you want the length to be accurate. You could just remove each apostrophe and the length will match the number of letters in the word, or you could create word objects with their own length fields, so that you can print the word as is, and know the number of letter characters in that word.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文