Java Scanner 使用正则表达式进行换行符解析（错误？）

发布于 2024-09-02 12:49:51 字数 862 浏览 6 评论 0原文

我正在用 Java 手工开发一个语法分析器，我想使用正则表达式来解析各种标记类型。问题是，如果输入不符合语法，我还希望能够准确报告当前行号。

长话短说，当我尝试将换行符与 Scanner 类实际匹配时遇到了问题。具体来说，当我尝试使用 Scanner 类将换行符与模式匹配时，它会失败。几乎总是如此。但是，当我使用 Matcher 和相同的源字符串执行相同的匹配时，它也会完全按照您的预期检索换行符。是否有我似乎无法发现的原因，或者正如我怀疑的那样，这是一个错误？

仅供参考：我无法在 Sun 数据库中找到描述此问题的错误，因此如果它是错误，则尚未报告。

示例代码：

Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
    scan.next(newLinePattern);
    count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
    count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines

原文

I'm developing a syntax analyzer by hand in Java, and I'd like to use regex's to parse the various token types. The problem is that I'd also like to be able to accurately report the current line number, if the input doesn't conform to the syntax.

Long story short, I've run into a problem when I try to actually match a newline with the Scanner class. To be specific, when I try to match a newline with a pattern using the Scanner class, it fails. Almost always. But when I perform the same matching using a Matcher and the same source string, it retrieves the newline exactly as you'd expect it too. Is there a reason for this, that I can't seem to discover, or is this a bug, as I suspect?

FYI: I was unable to find a bug in the Sun database that describes this issue, so if it is a bug, it hasn't been reported.

Example Code:

Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
    scan.next(newLinePattern);
    count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
    count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绅士风度i 2024-09-09 12:49:51

您的 useDelimiter() 和 next() 组合有问题。 useDelimiter("") 将在 next() 上返回 1 长度的子字符串，因为空字符串实际上位于每两个字符之间。

也就是说，因为 "\r\n".equals("\r" + "" + "\n") 所以 "\r\n" 实际上是两个标记，"\r" 和 "\n"，由 "" 分隔。

要获得 Matcher 行为，您需要 findWithinHorizon，它会忽略分隔符。

    Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
    String sourceString = "\r\n\n\r\r\n\n";
    Scanner scan = new Scanner(sourceString);
    int count = 0;
    while (scan.findWithinHorizon(newLinePattern, 0) != null) {
        count++;
    }
    System.out.println("found "+count+" newlines"); // finds 5 newlines

API 链接

findWithinHorizon(Pattern pattern, int Horizon)
<块引用>
尝试查找指定模式的下一个匹配项 [...] 忽略分隔符 [...] 如果未检测到此类模式，则返回 null [...] 如果 < code>horizon 为 0，则 [...] 此方法继续搜索输入，寻找不受限制的指定模式。

API links

findWithinHorizon(Pattern pattern, int horizon)
Attempts to find the next occurrence of the specified pattern [...] ignoring delimiters [...] If no such pattern is detected then the null is returned [...] If horizon is 0, then [...] this method continues to search through the input looking for the specified pattern without bound.

Related questions

Scanner method to get a char
- useDelimiter("") will tokenize into 1-length substrings

回复收藏 0 原文

寄人书 2024-09-09 12:49:51

事实上，这也是双方的预期行为。扫描器主要关心使用分隔符将事物分割成标记。因此，它（惰性地）获取您的 sourceString 并将其视为以下标记集： \r、\n、\n、\r、\r、\n 和 \n。然后，当您调用 hasNext 时，它会检查下一个标记是否与您的模式匹配（这要归功于 \r\n? 上的 ?）。因此，while 循环迭代 7 个标记中的每一个。

另一方面，匹配器将贪婪地匹配正则表达式 - 因此它将按照您的预期将 \r\n 捆绑在一起。

强调 Scanner 行为的一种方法是将正则表达式更改为 (\\r\\n|\\n)。这会导致计数为 0。这是因为扫描器将第一个标记读取为 \r（不是 \r\n），并且然后注意到它与您的模式不匹配，因此当您调用 hasNext 时返回 false。

（简短版本：扫描器在使用标记模式之前使用您的分隔符进行标记，匹配器不执行任何形式的标记）

回复收藏 0 原文

夜访吸血鬼 2024-09-09 12:49:51

可能值得一提的是，您的示例不明确。它可以是：（

\r
\n
\n
\r
\r
\n
\n

七行）

或：（

\r\n
\n
\r
\r\n
\n

五行）

？您使用的量词是贪婪量词，这可能会使 5 成为正确答案，但由于 Scanner 会迭代标记（在您的情况下是单个字符，由于您选择的分隔模式），它会不情愿地匹配，一次一个字符，得出错误答案七。

It might be worth mentioning that your example is ambiguous. It could be:

\r
\n
\n
\r
\r
\n
\n

(seven lines)

or:

\r\n
\n
\r
\r\n
\n

(five lines)

The ? quantifier you have used is a greedy quantifier, which would probably make five the right answer, but because Scanner iterates over tokens (in your case individual characters, due to the delimiting pattern you chose), it will match reluctantly, one character at a time, arriving at the incorrect answer of seven.

回复收藏 0 原文