Java：使用扫描仪分隔符作为标记

发布于 2024-08-23 05:49:27 字数 311 浏览 13 评论 0原文

我正在尝试找到一种好方法让扫描仪使用给定的分隔符作为令牌。例如，我想将一段文本分成数字和非数字块，所以理想情况下我只需将分隔符设置为 \D 并设置一些标志，例如 useDelimiterAsToken，但之后简单浏览一下 API，我什么也没想到。现在，我不得不诉诸于使用组合的前向/后向分隔符，这有点痛苦：

scanner.useDelimiter("((?<=\\d)(?=\\D)|(?<=\\D)(?=\\d))");

这会查找从数字到非数字的任何转换，反之亦然。有没有更明智的方法来做到这一点？

原文

I'm trying to find a good way to get a Scanner to use a given delimiter as a token. For example, I'd like to split up a piece of text into digit and non-digit chunks, so ideally I'd just set the delimiter to \D and set some flag like useDelimiterAsToken, but after briefly looking through the API I'm not coming up with anything. Right now I've had to resort to using combined lookaheads/lookbehinds for the delimiter, which is somewhat painful:

scanner.useDelimiter("((?<=\\d)(?=\\D)|(?<=\\D)(?=\\d))");

This looks for any transition from a digit to a non-digit or vice-versa. Is there a more sane way to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寒江雪… 2024-08-30 05:49:27

编辑：编辑后的问题是如此不同，我原来的答案根本不适用。根据记录，在我看来，您正在做的是解决问题的理想方法。您的分隔符是数字和非数字之间的零宽度边界，没有比您发布的内容更简洁的方式来表达这一点。

编辑2：（回应评论中提出的问题。）您最初要求此正则表达式的替代方案：

"((?<=\\w)(?=[^\\w])|(?<=[^\\w])(?=\\w))"

这几乎正是单词边界构造 \b 的工作方式：

"(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)"

也就是说，前面有单词字符但后面没有 1 的位置，或者后面有单词字符但前面没有 1 的位置。不同之处在于 \b 可以匹配输入的开头和结尾。您显然不希望这样，所以我添加了环视以排除这些条件：

"(?!^)\\b(?!$)"

这只是一种更简洁的方法来执行正则表达式所做的事情。但随后您将要求更改为匹配数字/非数字边界，并且没有像 \b 那样用于单词/非单词边界的简写。

EDIT: The edited question is so different, my original answer doesn't apply at all. For the record, what you're doing is the ideal way to solve your problem, in my opinion. Your delimiter is the zero-width boundary between a digit and a non-digit, and there's no more succinct way to express that than what you posted.

EDIT2: (In response to the question asked in the comment.) You originally asked for an alternative to this regex:

"((?<=\\w)(?=[^\\w])|(?<=[^\\w])(?=\\w))"

That's almost exactly how \b, the word-boundary construct, works:

"(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)"

That is, a position that's either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. The difference is that \b can match at the beginning and end of the input. You obviously didn't want that, so I added lookarounds to exclude those conditions:

"(?!^)\\b(?!$)"

It's just a more concise way to do what your regex did. But then you changed the requirement to matching digit/non-digit boundaries, and there's no shorthand for that like \b for word/non-word boundaries.

回复收藏 0 原文

~没有更多了~