java 正则表达式 棘手的模式
我被一个正则表达式困住了一段时间,它对我执行以下操作:
- 用这个分割我的句子:“[\W+]”
但是如果它找到这样的单词:“aaa-aa”(不是“aaa - aa”或“aaa--aaa-aa”),该单词不是拆分的,而是整个单词。
基本上,我想每个单词分割一个句子,但也考虑到“aaa-aa”是一个单词。我通过创建两个单独的函数成功地做到了这一点,一个用于用 \w 分割,另一个用于查找诸如“aaa-aa”之类的单词。最后,我将两者相加,并减去每个复合词。
例如这句话:
“你好,我的名字是理查德”
首先我收集{你好,我的名字是理查德} 然后我收集{我的名字} 然后我将 {my-name} 添加到 {Hello, my, name, is, Richard} 然后我在这里取出 {my} 和 {name} {Hello, my, name, is, Richard}。 结果:{你好,我的名字是理查德}
这种方法可以满足我的需要,但是对于解析大文件,这变得太重了,因为对于每个句子都需要太多副本。所以我的问题是,我可以做些什么来将所有内容都包含在一种模式中?喜欢:
“使用此模式“[\W+]”分割文本,但是如果您找到像“aaa-aa”这样的单词,请将其视为一个单词而不是两个单词。
I'm stucked for a while with a regex that does me the following:
- split my sentences with this: "[\W+]"
but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.
Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.
For example, the sentence:
"Hello my-name is Richard"
First i collect {Hello, my, name, is, Richard}
then i collect {my-name}
then i add {my-name} to {Hello, my, name, is, Richard}
then i take out {my} and {name} in here {Hello, my, name, is, Richard}.
result: {Hello, my-name, is, Richard}this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:
"split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您想使用 split() 而不是显式匹配您感兴趣的单词,则应执行以下操作:
[\s-]{2,}|\s
要打破这一点向下,您首先拆分两个或多个空格和/或连字符 - 因此单个“-”将不匹配,因此“一二”将被单独保留,但类似于“一--二”、“一-二”甚至'一个------二”将被分成“一”和“二”。这仍然使单个空白的“正常”情况 - “一二” - 不匹配,因此我们添加一个或('|'),后跟一个空白(\s)。请注意,替代项的顺序很重要 - RE 子表达式以“|”分隔从左到右评估,因此我们需要将空格和连字符替代放在第一位。如果我们反过来做,当遇到像“one -two”这样的东西时,我们会匹配第一个空白并返回“one”,“-two”。如果您想以交互方式使用 Java RE,我强烈推荐 http://myregexp.com/signedJar.html 允许您编辑 RE,并在编辑 RE 时查看它与示例字符串的匹配情况。
If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want:
[\s-]{2,}|\s
To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.
为什么不使用模式
\\s+
?这正是您想要的,没有任何技巧:按由空格分隔的单词分割文本。Why not to use pattern
\\s+
? This does exactly what you want without any tricks: splits text by words separated by whitespace.你的描述不够清楚,但为什么不直接用空格分开呢?
Your description isn't clear enough, but why not just split it up by spaces?
我不确定这种模式是否有效,因为我没有Java开发工具,不过你可以尝试一下,它使用字符类减法,据我所知,仅在Java正则表达式中支持:
它意味着匹配字符如果它们是[\W]和[^-],则字符是[\W]而不是[-]。
I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:
it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].
与您的上一个问题几乎相同的正则表达式:
刚刚添加了选项
(...)?
也匹配非连字符的单词。Almost the same regular expression as in your previous question:
Just added the option
(...)?
to also match non-hypened words.