Java正则表达式,捕获具有逗号分隔值的组

发布于 2024-08-22 04:31:12 字数 580 浏览 2 评论 0原文

InputString:士兵可能有瘀伤、伤口、痕迹、脱臼或其他伤害他的伤害。

预期输出
瘀伤
伤口
标记
错位
受伤

尝试过的一般模式

       ".[\s]?(\w+?)"+                 // bruises.
      "(?:(\s)?,(\s)?(\w+?))*"+             // wounds marks dislocations
      "[\s]?(?:or|and) other (\w+).";     // Injuries

该模式应该能够匹配其他输入字符串,例如:士兵可能有瘀伤或其他伤害。

在尝试上面的通用模式时,输出为: 瘀伤 位错 伤害

“(?:(\s)?,(\s)?(\w+?))*”的捕获组有问题。捕获组又出现了一次..但它只返回“错位”。 “痕迹”和“错位:被吞噬了。

你能建议什么应该是正确的模式,错误在哪里? 这个问题最接近这个

谢谢。

InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .

ExpectedOutput:

bruises

wounds

marks

dislocations

Injuries

Generalized Pattern Tried:

       ".[\s]?(\w+?)"+                 // bruises.
      "(?:(\s)?,(\s)?(\w+?))*"+             // wounds marks dislocations
      "[\s]?(?:or|and) other (\w+).";     // Injuries

The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.

On trying the generalized pattern above, the output is:
bruises
dislocations
Injuries

There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.

Could you please suggest what should be the right pattern, and where is the mistake?
This question comes closest to this question, but that solution didn't help.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清风不识月 2024-08-29 04:31:12

当捕获组用量词 [ie: (foo)*] 注释时,您将只能获得最后一个匹配项。如果您想获取所有这些值,那么您需要在捕获内进行量化,然后您将必须手动解析出这些值。尽管我是正则表达式的忠实粉丝,但出于多种原因,我认为它在这里不合适......即使您最终没有进行 NLP。

如何修复: (?:(\s)?,(\s)?(\w+?))*

好吧,在这种情况下,量词基本上覆盖了整个正则表达式,您不妨使用 Matcher.find() 来单步执行通过每场比赛。另外,我很好奇为什么你有空白捕获组。如果您想要做的只是找到一组以逗号分隔的单词,则类似于: \w+(?:\s*,\s*\w+)* 然后不要理会捕获组,只需拆分整个单词即可匹配。

对于任何更复杂的 re:NLP,GATE 是一个非常强大的工具。学习曲线有时很陡峭,但您可以从整个行业的科学人员中汲取经验:http://gate。 ac.uk/

When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.

How to fix: (?:(\s)?,(\s)?(\w+?))*

Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.

And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/

夏の忆 2024-08-29 04:31:12

正则表达式不适合(自然)语言处理。使用正则表达式,您只能匹配明确定义的模式。您真的应该放弃使用正则表达式执行此操作的想法。

您可能想开始一个新问题,在其中指定用于执行此任务的编程语言并在那里请求指针。

编辑

PSpeed 发布了一个有前途的第三方库链接,Gate,它能够完成许多语言处理任务。它是用 Java 编写的。我自己没有使用过它,但看看从事它工作的人员/机构,它看起来相当可靠。

Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.

You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.

EDIT

PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.

终止放荡 2024-08-29 04:31:12

有效的模式是: \w+(?:\s*,\s*\w+)* 然后手动分离 CSV
没有其他方法可以使用 Java Regex 来执行此操作。

理想情况下,Java 正则表达式不适合 NLP。文本挖掘的一个有用工具是:gate.ac.uk

感谢 Bart K. 和 PSpeed。

The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV
There is no other method to do this with Java Regex.

Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk

Thanks to Bart K. , and PSpeed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文