如何使用扫描仪分隔符(包括 Java 中的单引号或撇号)过滤掉文本文件中的非字母
请对文件中的每个单词进行计数,并且此计数不应包括撇号、逗号、句号、问号、感叹号等非字母,即字母表中的字母。 我尝试使用这样的分隔符,但它不包含撇号。
Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
int totalWordCount = 0;
//Firstly to count all the words in the file without the restricted characters
while (fileScanner.hasNext()) {
fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
totalWordCount++;
}
System.out.println("There are " + totalWordCount + " word(s)");
//Then later I create an array to store each individual word in the file for counting their lengths.
Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
String[] words = new String[totalWordCount];
for (int i = 0; i < totalWordCount; ++i) {
words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
}
这似乎不起作用!
请问我该怎么办?
Pls I want to keep a count of every word from a file, and this count should not include non letters like the apostrophe, comma, fullstop, question mark, exclamation mark, e.t.c. i.e just letters of the alphabet.
I tried to use a delimiter like this, but it didn't include the apostrophe.
Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
int totalWordCount = 0;
//Firstly to count all the words in the file without the restricted characters
while (fileScanner.hasNext()) {
fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
totalWordCount++;
}
System.out.println("There are " + totalWordCount + " word(s)");
//Then later I create an array to store each individual word in the file for counting their lengths.
Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
String[] words = new String[totalWordCount];
for (int i = 0; i < totalWordCount; ++i) {
words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
}
This doesn't seem to work !
Please how can I go about this ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在我看来,除了空格和结束线之外,您不想使用任何内容进行过滤。例如,如果您使用 ' 来过滤单词数,则单词“they're”将作为两个单词返回。以下是您可以更改原始代码以使其正常工作的方法。
使用
Pattern.compile()
将字符串转换为正则表达式。 '\s' 字符是在 Pattern 类中预定义的,用于匹配所有空白字符。有更多信息,请访问
模式文档
另外,请确保在完成后关闭您的扫描仪类。这可能会阻止您的第二个扫描仪打开。
编辑
如果您想计算每个单词的字母数,您可以将以下代码添加到上面的代码中,
我已经测试了此代码,它似乎对我有用。
replaceAll
,根据JavaDoc 使用正则表达式进行匹配,因此它应该匹配任何这些字符并从本质上删除它。Seems to me that you don't want to filter using anything but spaces and end lines. For example the word "they're" would return as two words if you're using a ' to filter your number of words. Here's how you could change your original code to make it work.
Using the
Pattern.compile()
turns your string into a regular expression. The '\s' character is predefined in the Pattern class to match all white space characters.There is more information at
Pattern Documentation
Also, make sure to close your Scanner classes when you're done. This could prevent your second scanner from opening.
Edit
If you want to count the letters per word you can add the following code to the above code
I have tested this code and it appears to work for me. The
replaceAll
, according to the JavaDoc uses a regular expression to match so it should match any of those characters and essentially remove it.分隔符不是正则表达式,因此在您的示例中,它正在查找 "[.,:;()?!\" \t\n\r]+" 之间分割的内容
您可以使用 regexp 而不是
使用 分隔符带有 group 方法的 regexp 类可能就是您所寻找的,
尝试一下这些类,您会发现它与您需要的更加相似。
The Delimiter is not a regular expression, so with your example it is looking for things split between "[.,:;()?!\" \t\n\r]+"
You can either use regexp instead of the Delimiter
using the regexp class with the group method may be what your looking for.
Play with those classes and you will see it is much more similar to what you need
您可以在分隔符中尝试这个正则表达式:
fileScanner.useDelimiter(("[^a-zA-Z]|[^\']")).next();
这将使用任何非字母字符或非撇号作为分隔符。这样,您的单词将包含撇号,但不包含任何其他非字母字符。
然后,如果您希望长度准确,则必须循环遍历每个单词并检查撇号并考虑它们。您可以删除每个撇号,长度将与单词中的字母数相匹配,或者您可以创建具有自己的长度字段的单词对象,以便您可以按原样打印单词,并知道其中的字母字符数单词。
You could try this regex in your delimiter:
fileScanner.useDelimiter(("[^a-zA-Z]|[^\']")).next();
This will use any non-letter character OR non apostrophe as a delimiter. That way your words will include the apostrophe but not any other non-letter character.
Then you'll have to loop through each word and check for apostrophe's and account for them if you want the length to be accurate. You could just remove each apostrophe and the length will match the number of letters in the word, or you could create word objects with their own length fields, so that you can print the word as is, and know the number of letter characters in that word.