使用来自扫描仪的“abc-def”分隔符功能
我目前正在尝试过滤一个文本文件,其中包含用“-”分隔的单词。 我想数一下字数。
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
出现的问题很简单:包含“-”的单词将被分开并被计为两个单词。 因此,仅使用 \- 转义并不是选择的解决方案。
如何更改分隔符表达式,以便保留“foo-bar”之类的单词,但单独的“-”将被过滤掉并忽略?
谢谢 ;)
I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
好吧,我在这里猜测你的问题:你的意思是你有一个文本文件,其中包含一些“真正的”散文,即实际上有意义的句子,由标点符号等分隔,对吧?
例子:
因此,您需要的分隔符可以是任意数量的空格和/或标点符号(您已经用显示的正则表达式覆盖了它们),或者是每侧至少被一个空格包围的连字符。 “或”的正则表达式字符是“|”。 在许多正则表达式实现中,空白字符类(空格、制表符和换行符)有一个快捷方式:“\s”。
OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
如果可能的话尝试使用预定义的类...使正则表达式更容易阅读。 有关选项,请参阅 java.util.regex.Pattern。
也许这就是您正在寻找的:
读取:匹配 1 个或多个空白字符,可选地后跟零个或多个非单词字符和一个空白字符。
If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.
这不是很简单。 要尝试的一件事是 {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}。
忽略扫描仪返回的完全由连字符组成的单词可能会更容易
This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens
NB
next(String) 方法断言您只能得到单词,因为原始 useDelimiter() 方法缺少“|”
注意
您已使用正则表达式“\r\n|\n”作为行终止符。 java.util.regex.Pattern 的 JavaDocs 显示了其他可能的行终止符,因此更完整的检查将使用表达式“\r\n|[\r\n\u2028\u2029\u0085]”
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"
这应该足够简单:
[^\\w-]\\W*|-\\W+
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
编辑: 这些是更简单的形式。 请记住,完整的解决方案将遵循这种模式,处理行首和行尾的破折号。
(?:^|[^\\w-])\\W*|-(?:\\W+|$)
This should be a simple enough:
[^\\w-]\\W*|-\\W+
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern.
(?:^|[^\\w-])\\W*|-(?:\\W+|$)