使用来自扫描仪的“abc-def”分隔符功能

发布于 2024-07-17 06:41:31 字数 262 浏览 8 评论 0原文

我目前正在尝试过滤一个文本文件,其中包含用“-”分隔的单词。 我想数一下字数。

scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));

出现的问题很简单:包含“-”的单词将被分开并被计为两个单词。 因此,仅使用 \- 转义并不是选择的解决方案。

如何更改分隔符表达式,以便保留“foo-bar”之类的单词,但单独的“-”将被过滤掉并忽略?

谢谢 ;)

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.

scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));

The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.

How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?

Thanks ;)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

半世晨晓 2024-07-24 06:41:31

好吧,我在这里猜测你的问题:你的意思是你有一个文本文件,其中包含一些“真正的”散文,即实际上有意义的句子,由标点符号等分隔,对吧?

例子:

据我们所知,我们最值得信赖的盟友沃尔贡人继续举办诗歌大满贯比赛,这一情况得到了改善; 敌人几乎没有动力去干扰这一点,即使他们有 Mute-O-Matic 装置。

因此,您需要的分隔符可以是任意数量的空格和/或标点符号(您已经用显示的正则表达式覆盖了它们),或者是每侧至少被一个空格包围的连字符。 “或”的正则表达式字符是“|”。 在许多正则表达式实现中,空白字符类(空格、制表符和换行符)有一个快捷方式:“\s”。

"[.,:;()?!\"\s]+|\s+-\s+"

OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?

Example:

This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.

So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".

"[.,:;()?!\"\s]+|\s+-\s+"
扮仙女 2024-07-24 06:41:31

如果可能的话尝试使用预定义的类...使正则表达式更容易阅读。 有关选项,请参阅 java.util.regex.Pattern。

也许这就是您正在寻找的:

string.split("\\s+(\\W*\\s)?"

读取:匹配 1 个或多个空白字符,可选地后跟零个或多个非单词字符和一个空白字符。

If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.

Maybe this is what you are looking for:

string.split("\\s+(\\W*\\s)?"

Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.

策马西风 2024-07-24 06:41:31

这不是很简单。 要尝试的一件事是 {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}。

忽略扫描仪返回的完全由连字符组成的单词可能会更容易

This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.

It might be easier to just ignore words returned by scanner consisting entirely of hyphens

站稳脚跟 2024-07-24 06:41:31
Scanner scanner = new Scanner("one   two2  -   (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");

while (scanner.hasNext()) {
    System.out.println(scanner.next("\\w+(-\\w+)*"));
}

NB

next(String) 方法断言您只能得到单词,因为原始 useDelimiter() 方法缺少“|”

注意

您已使用正则表达式“\r\n|\n”作为行终止符。 java.util.regex.Pattern 的 JavaDocs 显示了其他可能的行终止符,因此更完整的检查将使用表达式“\r\n|[\r\n\u2028\u2029\u0085]”

Scanner scanner = new Scanner("one   two2  -   (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");

while (scanner.hasNext()) {
    System.out.println(scanner.next("\\w+(-\\w+)*"));
}

NB

the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"

NB

you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"

清醇 2024-07-24 06:41:31

这应该足够简单: [^\\w-]\\W*|-\\W+

  • 但当然,如果它是散文,并且您想排除下划线
    [^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
  • 或者如果您不需要数字:
    [^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+

编辑: 这些是更简单的形式。 请记住,完整的解决方案将遵循这种模式,处理行首和行尾的破折号。 (?:^|[^\\w-])\\W*|-(?:\\W+|$)

This should be a simple enough: [^\\w-]\\W*|-\\W+

  • But of course if it's prose, and you want to exclude underscores:
    [^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
  • or if you don't expect numerics:
    [^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+

EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文