使用来自扫描仪的“abc-def”分隔符功能

发布于 2024-07-17 06:41:31 字数 262 浏览 13 评论 0原文

我目前正在尝试过滤一个文本文件，其中包含用“-”分隔的单词。我想数一下字数。

scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));

出现的问题很简单：包含“-”的单词将被分开并被计为两个单词。因此，仅使用 \- 转义并不是选择的解决方案。

如何更改分隔符表达式，以便保留“foo-bar”之类的单词，但单独的“-”将被过滤掉并忽略？

谢谢 ;）

原文

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.

scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));

The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.

How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?

Thanks ;)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半世晨晓 2024-07-24 06:41:31

好吧，我在这里猜测你的问题：你的意思是你有一个文本文件，其中包含一些“真正的”散文，即实际上有意义的句子，由标点符号等分隔，对吧？

例子：

据我们所知，我们最值得信赖的盟友沃尔贡人继续举办诗歌大满贯比赛，这一情况得到了改善；敌人几乎没有动力去干扰这一点，即使他们有 Mute-O-Matic 装置。

因此，您需要的分隔符可以是任意数量的空格和/或标点符号（您已经用显示的正则表达式覆盖了它们），或者是每侧至少被一个空格包围的连字符。 “或”的正则表达式字符是“|”。在许多正则表达式实现中，空白字符类（空格、制表符和换行符）有一个快捷方式：“\s”。

"[.,:;()?!\"\s]+|\s+-\s+"

OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?

Example:

This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.

So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".

"[.,:;()?!\"\s]+|\s+-\s+"

回复收藏 0 原文

扮仙女 2024-07-24 06:41:31

如果可能的话尝试使用预定义的类...使正则表达式更容易阅读。有关选项，请参阅 java.util.regex.Pattern。

也许这就是您正在寻找的：

string.split("\\s+(\\W*\\s)?"

读取：匹配 1 个或多个空白字符，可选地后跟零个或多个非单词字符和一个空白字符。

If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.

Maybe this is what you are looking for:

string.split("\\s+(\\W*\\s)?"

Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.

回复收藏 0 原文

策马西风 2024-07-24 06:41:31

这不是很简单。要尝试的一件事是 {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}。

忽略扫描仪返回的完全由连字符组成的单词可能会更容易

回复收藏 0 原文

站稳脚跟 2024-07-24 06:41:31

Scanner scanner = new Scanner("one   two2  -   (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");

while (scanner.hasNext()) {
    System.out.println(scanner.next("\\w+(-\\w+)*"));
}

next(String) 方法断言您只能得到单词，因为原始 useDelimiter() 方法缺少“|”

注意

您已使用正则表达式“\r\n|\n”作为行终止符。 java.util.regex.Pattern 的 JavaDocs 显示了其他可能的行终止符，因此更完整的检查将使用表达式“\r\n|[\r\n\u2028\u2029\u0085]”

Scanner scanner = new Scanner("one   two2  -   (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");

while (scanner.hasNext()) {
    System.out.println(scanner.next("\\w+(-\\w+)*"));
}

the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"

you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"

回复收藏 0 原文