Java Scanner 使用正则表达式进行换行符解析(错误?)
我正在用 Java 手工开发一个语法分析器,我想使用正则表达式来解析各种标记类型。问题是,如果输入不符合语法,我还希望能够准确报告当前行号。
长话短说,当我尝试将换行符与 Scanner 类实际匹配时遇到了问题。具体来说,当我尝试使用 Scanner 类将换行符与模式匹配时,它会失败。几乎总是如此。但是,当我使用 Matcher 和相同的源字符串执行相同的匹配时,它也会完全按照您的预期检索换行符。是否有我似乎无法发现的原因,或者正如我怀疑的那样,这是一个错误?
仅供参考:我无法在 Sun 数据库中找到描述此问题的错误,因此如果它是错误,则尚未报告。
示例代码:
Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
scan.next(newLinePattern);
count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines
I'm developing a syntax analyzer by hand in Java, and I'd like to use regex's to parse the various token types. The problem is that I'd also like to be able to accurately report the current line number, if the input doesn't conform to the syntax.
Long story short, I've run into a problem when I try to actually match a newline with the Scanner class. To be specific, when I try to match a newline with a pattern using the Scanner class, it fails. Almost always. But when I perform the same matching using a Matcher and the same source string, it retrieves the newline exactly as you'd expect it too. Is there a reason for this, that I can't seem to discover, or is this a bug, as I suspect?
FYI: I was unable to find a bug in the Sun database that describes this issue, so if it is a bug, it hasn't been reported.
Example Code:
Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
scan.next(newLinePattern);
count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您的
useDelimiter()
和next()
组合有问题。useDelimiter("")
将在next()
上返回 1 长度的子字符串,因为空字符串实际上位于每两个字符之间。也就是说,因为
"\r\n".equals("\r" + "" + "\n")
所以"\r\n"
实际上是两个标记,"\r"
和"\n"
,由""
分隔。要获得
Matcher
行为,您需要findWithinHorizon
,它会忽略分隔符。API 链接
findWithinHorizon(Pattern pattern, int Horizon)
<块引用>
尝试查找指定模式的下一个匹配项 [...] 忽略分隔符 [...] 如果未检测到此类模式,则返回
null
[...] 如果 < code>horizon 为 0,则 [...] 此方法继续搜索输入,寻找不受限制的指定模式。相关问题
useDelimiter("")
将标记为 1 长度的子字符串Your
useDelimiter()
andnext()
combo is faulty.useDelimiter("")
will return 1-length substring onnext()
, because an empty string does in fact sit between every two characters.That is, because
"\r\n".equals("\r" + "" + "\n")
so"\r\n"
are in fact two tokens,"\r"
and"\n"
, delimited by""
.To get the
Matcher
-behavior, you needfindWithinHorizon
, which ignores delimiters.API links
findWithinHorizon(Pattern pattern, int horizon)
Related questions
useDelimiter("")
will tokenize into 1-length substrings事实上,这也是双方的预期行为。扫描器主要关心使用分隔符将事物分割成标记。因此,它(惰性地)获取您的 sourceString 并将其视为以下标记集:
\r
、\n
、\n
、\r
、\r
、\n
和\n
。然后,当您调用 hasNext 时,它会检查下一个标记是否与您的模式匹配(这要归功于\r\n?
上的?
)。因此,while 循环迭代 7 个标记中的每一个。另一方面,匹配器将贪婪地匹配正则表达式 - 因此它将按照您的预期将
\r\n
捆绑在一起。强调 Scanner 行为的一种方法是将正则表达式更改为
(\\r\\n|\\n)
。这会导致计数为 0。这是因为扫描器将第一个标记读取为\r
(不是\r\n
),并且然后注意到它与您的模式不匹配,因此当您调用hasNext
时返回 false。(简短版本:扫描器在使用标记模式之前使用您的分隔符进行标记,匹配器不执行任何形式的标记)
That is, in fact, the expected behaviour of both. The scanner primarily cares about splitting things into tokens using your delimiter. So it (lazily) takes your sourceString and sees it as the following set of tokens:
\r
,\n
,\n
,\r
,\r
,\n
, and\n
. When you then call hasNext it checks if the next token matches your pattern (which they all trivially do thanks to the?
on the\r\n?
). The while loop therefore iterates over each of the 7 tokens.On the other hand, the matcher will match the regex greedily - so it bundles the
\r\n
s together as you expect.One way to emphasise the behaviour of Scanner is to change your regexp to
(\\r\\n|\\n)
. This results in a count of 0. This is because the scanner reads the first token as\r
(not\r\n
), and then notices it doesn't match your pattern, so returns false when you callhasNext
.(Short version: the scanner tokenises using your delimiter before using your token pattern, the matcher doesn't do any form of tokenising)
可能值得一提的是,您的示例不明确。它可以是:(
七行)
或:(
五行)
?您使用的量词是贪婪量词,这可能会使 5 成为正确答案,但由于 Scanner 会迭代标记(在您的情况下是单个字符,由于您选择的分隔模式),它会不情愿地匹配,一次一个字符,得出错误答案七。
It might be worth mentioning that your example is ambiguous. It could be:
(seven lines)
or:
(five lines)
The ? quantifier you have used is a greedy quantifier, which would probably make five the right answer, but because Scanner iterates over tokens (in your case individual characters, due to the delimiting pattern you chose), it will match reluctantly, one character at a time, arriving at the incorrect answer of seven.
当您使用带有
""
分隔符的Scanner
时,它将生成每个字符长的标记。这是在应用新行正则表达式之前。然后,它将每个字符与新行正则表达式进行匹配;每一个都匹配,因此会产生 7 个令牌。但是,由于它将字符串拆分为 1 个字符的标记,因此不会将相邻的\r\n
字符分组为一个标记。When you use the
Scanner
with a delimiter of""
it will produce tokens that are each one character long. This is before your new line regex is applied. It then matches each of these characters against the new line regex; each one matches, so it produces 7 tokens. However, because it split the string into 1-character tokens it will not group adjacent\r\n
characters into one token.