正则表达式忽略下划线
我有一个正则表达式 ([-@.\/,':\w]*[\w])*
,它匹配文本中的所有单词(包括像 IBM 这样的标点符号单词),但我想要为了使其排除下划线,我似乎不知道该怎么做...我尝试添加 ^[_]
(例如 (^[_][-@.\ /,':\w]*[\w])*
) 但它只是将所有单词分解为字母。我想保留单词匹配,但我不想包含带有下划线的单词,也不想包含完全由下划线组成的单词。
执行此操作的正确方法是什么?
PS
- 我的应用程序是用 C# 编写的(如果这有什么区别的话)。
- 我无法使用 A-Za-z0-9,因为我必须匹配单词,无论语言如何(可能是中文、俄语、日语、德语、英语)。
更新
下面是一个示例:
“IBM 应该被解析为一个单词 w_o_r_d!俄语也应该有效:мплекс исторических событий。”
匹配应该是:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
请注意,w_o_r_d
不应匹配。
I have a regex ([-@.\/,':\w]*[\w])*
and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_]
(e.g. (^[_][-@.\/,':\w]*[\w])*
) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
- My app is written in C# (if that makes any difference).
- I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).
Update
Here is an example:
"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d
should not get matched.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
请尝试这样做:
当您执行 Unicode 匹配时,
\w
类由[\p{L}\p{Nd}\p{Pc}]
组成。 (或者如果您正在进行非 Unicode 匹配,则简单地使用[a-zA-Z0-9]
。)它是
\p{Pc}
Unicode 类别 -- punctuation/连接器——通过匹配下划线导致问题,因此我们显式匹配其他类别而不包括该类别。(更多信息请参见“字符类:单词字符”,此处为“字符类:支持的 Unicode 常规类别”。 )
Try this instead:
The
\w
class is composed of[\p{L}\p{Nd}\p{Pc}]
when you're performing Unicode matching. (Or simply[a-zA-Z0-9]
if you're doing non-Unicode matching.)It's the
\p{Pc}
Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)
Tue 下划线来自
\w
。只需使用
A-Za-z0-9
即可。Tue underscore comes from
\w
.Simply use
A-Za-z0-9
instead.对于 LukeH 的正则表达式的更简洁版本,您可以简单地使用:
我只是使用
\p{L}
而不是Lu, Ll, Lt, Lo, Lm
。请参阅支持的 Unicode 常规类别For a more concise version of LukeH's regex, you can use simply:
I simply used
\p{L}
instead ofLu, Ll, Lt, Lo, Lm
. See Supported Unicode General Categories