java 正则表达式 棘手的模式

发布于 2024-12-07 15:12:00 字数 631 浏览 0 评论 0原文

我被一个正则表达式困住了一段时间,它对我执行以下操作:

  • 用这个分割我的句子:“[\W+]”
  • 但是如果它找到这样的单词:“aaa-aa”(不是“aaa - aa”或“aaa--aaa-aa”),该单词不是拆分的,而是整个单词。

    基本上,我想每个单词分割一个句子,但也考虑到“aaa-aa”是一个单词。我通过创建两个单独的函数成功地做到了这一点,一个用于用 \w 分割,另一个用于查找诸如“aaa-aa”之类的单词。最后,我将两者相加,并减去每个复合词。

    例如这句话:

    “你好,我的名字是理查德”

    首先我收集{你好,我的名字是理查德} 然后我收集{我的名字} 然后我将 {my-name} 添加到 {Hello, my, name, is, Richard} 然后我在这里取出 {my} 和 {name} {Hello, my, name, is, Richard}。 结果:{你好,我的名字是理查德}

    这种方法可以满足我的需要,但是对于解析大文件,这变得太重了,因为对于每个句子都需要太多副本。所以我的问题是,我可以做些什么来将所有内容都包含在一种模式中?喜欢:

    “使用此模式“[\W+]”分割文本,但是如果您找到像“aaa-aa”这样的单词,请将其视为一个单词而不是两个单词。

I'm stucked for a while with a regex that does me the following:

  • split my sentences with this: "[\W+]"
  • but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.

    Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.

    For example, the sentence:

    "Hello my-name is Richard"

    First i collect {Hello, my, name, is, Richard}
    then i collect {my-name}
    then i add {my-name} to {Hello, my, name, is, Richard}
    then i take out {my} and {name} in here {Hello, my, name, is, Richard}.
    result: {Hello, my-name, is, Richard}

    this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:

    "split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

深居我梦 2024-12-14 15:12:00

如果您想使用 split() 而不是显式匹配您感兴趣的单词,则应执行以下操作: [\s-]{2,}|\s 要打破这一点向下,您首先拆分两个或多个空格和/或连字符 - 因此单个“-”将不匹配,因此“一二”将被单独保留,但类似于“一--二”、“一-二”甚至'一个------二”将被分成“一”和“二”。这仍然使单个空白的“正常”情况 - “一二” - 不匹配,因此我们添加一个或('|'),后跟一个空白(\s)。请注意,替代项的顺序很重要 - RE 子表达式以“|”分隔从左到右评估,因此我们需要将空格和连字符替代放在第一位。如果我们反过来做,当遇到像“one -two”这样的东西时,我们会匹配第一个空白并返回“one”,“-two”。

如果您想以交互方式使用 Java RE,我强烈推荐 http://myregexp.com/signedJar.html 允许您编辑 RE,并在编辑 RE 时查看它与示例字符串的匹配情况。

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.

If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

做个ˇ局外人 2024-12-14 15:12:00

为什么不使用模式 \\s+?这正是您想要的,没有任何技巧:按由空格分隔的单词分割文本。

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

倾听心声的旋律 2024-12-14 15:12:00

你的描述不够清楚,但为什么不直接用空格分开呢?

Your description isn't clear enough, but why not just split it up by spaces?

給妳壹絲溫柔 2024-12-14 15:12:00

我不确定这种模式是否有效,因为我没有Java开发工具,不过你可以尝试一下,它使用字符类减法,据我所知,仅在Java正则表达式中支持:

[\W&&[^-]]+

它意味着匹配字符如果它们是[\W]和[^-],则字符是[\W]而不是[-]。

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:

[\W&&[^-]]+

it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

浮萍、无处依 2024-12-14 15:12:00

与您的上一个问题几乎相同的正则表达式:

String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
    System.out.println(matcher.group());
}

刚刚添加了选项(...)?也匹配非连字符的单词。

Almost the same regular expression as in your previous question:

String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Just added the option (...)? to also match non-hypened words.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文