Java Scanner 使用正则表达式进行换行符解析(错误?)

发布于 2024-09-02 12:49:51 字数 862 浏览 6 评论 0原文

我正在用 Java 手工开发一个语法分析器,我想使用正则表达式来解析各种标记类型。问题是,如果输入不符合语法,我还希望能够准确报告当前行号。

长话短说,当我尝试将换行符与 Scanner 类实际匹配时遇到了问题。具体来说,当我尝试使用 Scanner 类将换行符与模式匹配时,它会失败。几乎总是如此。但是,当我使用 Matcher 和相同的源字符串执行相同的匹配时,它也会完全按照您的预期检索换行符。是否有我似乎无法发现的原因,或者正如我怀疑的那样,这是一个错误?

仅供参考:我无法在 Sun 数据库中找到描述此问题的错误,因此如果它是错误,则尚未报告。

示例代码:

Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
    scan.next(newLinePattern);
    count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
    count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines

I'm developing a syntax analyzer by hand in Java, and I'd like to use regex's to parse the various token types. The problem is that I'd also like to be able to accurately report the current line number, if the input doesn't conform to the syntax.

Long story short, I've run into a problem when I try to actually match a newline with the Scanner class. To be specific, when I try to match a newline with a pattern using the Scanner class, it fails. Almost always. But when I perform the same matching using a Matcher and the same source string, it retrieves the newline exactly as you'd expect it too. Is there a reason for this, that I can't seem to discover, or is this a bug, as I suspect?

FYI: I was unable to find a bug in the Sun database that describes this issue, so if it is a bug, it hasn't been reported.

Example Code:

Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
    scan.next(newLinePattern);
    count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
    count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

绅士风度i 2024-09-09 12:49:51

您的 useDelimiter()next() 组合有问题。 useDelimiter("") 将在 next() 上返回 1 长度的子字符串,因为空字符串实际上位于每两个字符之间。

也就是说,因为 "\r\n".equals("\r" + "" + "\n") 所以 "\r\n" 实际上是两个标记,"\r""\n",由 "" 分隔。

要获得 Matcher 行为,您需要 findWithinHorizo​​n,它会忽略分隔符。

    Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
    String sourceString = "\r\n\n\r\r\n\n";
    Scanner scan = new Scanner(sourceString);
    int count = 0;
    while (scan.findWithinHorizon(newLinePattern, 0) != null) {
        count++;
    }
    System.out.println("found "+count+" newlines"); // finds 5 newlines

API 链接

  • findWithinHorizo​​n(Pattern pattern, int Horizo​​n)

    <块引用>

    尝试查找指定模式的下一个匹配项 [...] 忽略分隔符 [...] 如果未检测到此类模式,则返回 null [...] 如果 < code>horizo​​n 为 0,则 [...] 此方法继续搜索输入,寻找不受限制的指定模式。

相关问题

Your useDelimiter() and next() combo is faulty. useDelimiter("") will return 1-length substring on next(), because an empty string does in fact sit between every two characters.

That is, because "\r\n".equals("\r" + "" + "\n") so "\r\n" are in fact two tokens, "\r" and "\n", delimited by "".

To get the Matcher-behavior, you need findWithinHorizon, which ignores delimiters.

    Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
    String sourceString = "\r\n\n\r\r\n\n";
    Scanner scan = new Scanner(sourceString);
    int count = 0;
    while (scan.findWithinHorizon(newLinePattern, 0) != null) {
        count++;
    }
    System.out.println("found "+count+" newlines"); // finds 5 newlines

API links

  • findWithinHorizon(Pattern pattern, int horizon)

    Attempts to find the next occurrence of the specified pattern [...] ignoring delimiters [...] If no such pattern is detected then the null is returned [...] If horizon is 0, then [...] this method continues to search through the input looking for the specified pattern without bound.

Related questions

寄人书 2024-09-09 12:49:51

事实上,这也是双方的预期行为。扫描器主要关心使用分隔符将事物分割成标记。因此,它(惰性地)获取您的 sourceString 并将其视为以下标记集: \r\n\n\r\r\n\n。然后,当您调用 hasNext 时,它会检查下一个标记是否与您的模式匹配(这要归功于 \r\n? 上的 ?)。因此,while 循环迭代 7 个标记中的每一个。

另一方面,匹配器将贪婪地匹配正则表达式 - 因此它将按照您的预期将 \r\n 捆绑在一起。

强调 Scanner 行为的一种方法是将正则表达式更改为 (\\r\\n|\\n)。这会导致计数为 0。这是因为扫描器将第一个标记读取为 \r不是 \r\n),并且然后注意到它与您的模式不匹配,因此当您调用 hasNext 时返回 false。

(简短版本:扫描器在使用标记模式之前使用您的分隔符进行标记,匹配器不执行任何形式的标记)

That is, in fact, the expected behaviour of both. The scanner primarily cares about splitting things into tokens using your delimiter. So it (lazily) takes your sourceString and sees it as the following set of tokens: \r, \n, \n, \r, \r, \n, and \n. When you then call hasNext it checks if the next token matches your pattern (which they all trivially do thanks to the ? on the \r\n?). The while loop therefore iterates over each of the 7 tokens.

On the other hand, the matcher will match the regex greedily - so it bundles the \r\ns together as you expect.

One way to emphasise the behaviour of Scanner is to change your regexp to (\\r\\n|\\n). This results in a count of 0. This is because the scanner reads the first token as \r (not \r\n), and then notices it doesn't match your pattern, so returns false when you call hasNext.

(Short version: the scanner tokenises using your delimiter before using your token pattern, the matcher doesn't do any form of tokenising)

夜访吸血鬼 2024-09-09 12:49:51

可能值得一提的是,您的示例不明确。它可以是:(

\r
\n
\n
\r
\r
\n
\n

七行)

或:(

\r\n
\n
\r
\r\n
\n

五行)

?您使用的量词是贪婪量词,这可能会使 5 成为正确答案,但由于 Scanner 会迭代标记(在您的情况下是单个字符,由于您选择的分隔模式),它会不情愿地匹配,一次一个字符,得出错误答案七。

It might be worth mentioning that your example is ambiguous. It could be:

\r
\n
\n
\r
\r
\n
\n

(seven lines)

or:

\r\n
\n
\r
\r\n
\n

(five lines)

The ? quantifier you have used is a greedy quantifier, which would probably make five the right answer, but because Scanner iterates over tokens (in your case individual characters, due to the delimiting pattern you chose), it will match reluctantly, one character at a time, arriving at the incorrect answer of seven.

若能看破又如何 2024-09-09 12:49:51

当您使用带有 "" 分隔符的 Scanner 时,它将生成每个字符长的标记。这是在应用新行正则表达式之前。然后,它将每个字符与新行正则表达式进行匹配;每一个都匹配,因此会产生 7 个令牌。但是,由于它将字符串拆分为 1 个字符的标记,因此不会将相邻的 \r\n 字符分组为一个标记。

When you use the Scanner with a delimiter of "" it will produce tokens that are each one character long. This is before your new line regex is applied. It then matches each of these characters against the new line regex; each one matches, so it produces 7 tokens. However, because it split the string into 1-character tokens it will not group adjacent \r\n characters into one token.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文