正则表达式忽略下划线

发布于 2024-10-27 15:46:39 字数 690 浏览 1 评论 0原文

我有一个正则表达式 ([-@.\/,':\w]*[\w])* ，它匹配文本中的所有单词（包括像 IBM 这样的标点符号单词），但我想要为了使其排除下划线，我似乎不知道该怎么做...我尝试添加 ^[_] （例如 (^[_][-@.\ /,':\w]*[\w])*) 但它只是将所有单词分解为字母。我想保留单词匹配，但我不想包含带有下划线的单词，也不想包含完全由下划线组成的单词。

执行此操作的正确方法是什么？

我的应用程序是用 C# 编写的（如果这有什么区别的话）。
我无法使用 A-Za-z0-9，因为我必须匹配单词，无论语言如何（可能是中文、俄语、日语、德语、英语）。

更新
下面是一个示例：

“IBM 应该被解析为一个单词 w_o_r_d！俄语也应该有效：мплекс исторических событий。”

匹配应该是：

I.B.M.  
should  
be  
parsed  
as  
one  
word  
Russian  
should  
work  
too  
мплекс  
исторических  
событий

请注意，w_o_r_d 不应匹配。

原文

I have a regex ([-@.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-@.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.

Whats the proper way to do this?

P.S.

My app is written in C# (if that makes any difference).
I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).

Update
Here is an example:

"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."

The matches should be:

I.B.M.  
should  
be  
parsed  
as  
one  
word  
Russian  
should  
work  
too  
мплекс  
исторических  
событий

Note that w_o_r_d should not get matched.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

想挽留 2024-11-03 15:46:39

请尝试这样做：

([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*

当您执行 Unicode 匹配时，\w 类由 [\p{L}\p{Nd}\p{Pc}] 组成。（或者如果您正在进行非 Unicode 匹配，则简单地使用 [a-zA-Z0-9]。）

它是 \p{Pc} Unicode 类别 -- punctuation/连接器——通过匹配下划线导致问题，因此我们显式匹配其他类别而不包括该类别。

（更多信息请参见“字符类：单词字符”，此处为“字符类：支持的 Unicode 常规类别”。）

Try this instead:

([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*

The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)

It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.

(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)

回复收藏 0 原文