javascript中的utf-8字边界正则表达式
在 JavaScript 中:
"ab abc cab ab ab".replace(/\bab\b/g, "AB");
正确地给出了:
"AB abc cab AB AB"
当我使用 utf-8 字符时:
"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");
字边界运算符 似乎不起作用:
"αβ αβγ γαβ αβ αβ"
有解决方案吗?
In JavaScript:
"ab abc cab ab ab".replace(/\bab\b/g, "AB");
correctly gives me:
"AB abc cab AB AB"
When I use utf-8 characters though:
"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");
the word boundary operator doesn't seem to work:
"αβ αβγ γαβ αβ αβ"
Is there a solution to this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
仅当单词字符之前或之后没有另一个单词字符时,单词边界断言才匹配(因此
.\b.
等于\W\w
且\w\W
)。\w
定义为[A-Za-z0-9_]
。因此\w
与希腊字符不匹配。因此在这种情况下不能使用\b
。你可以做的是使用这个:
The word boundary assertion does only match if a word character is not preceded or followed by another word character (so
.\b.
is equal to\W\w
and\w\W
). And\w
is defined as[A-Za-z0-9_]
. So\w
doesn’t match greek characters. And thus you cannot use\b
for this case.What you could do instead is to use this:
并非所有 Javascript 正则表达式实现都支持 Unicode 广告,因此您需要转义它
要映射字符,您可以查看 http://htmlhelp.com/reference/html40/entities/symbols.html
当然,这对解决单词边界问题(如其他答案中所解释的)没有帮助,但应该在至少能让你正确匹配字符
Not all Javascript regexp implementation has support for Unicode ad so you need to escape it
For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html
Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly
我需要一些可以编程并处理标点符号、括号等的东西。
http://jsfiddle.net/AQvyd/
我编写了一个 javascript 资源编辑器,所以这就是我找到此页面并出于必要性回答它的原因,因为我找不到适用于 Unicode 的字边界参数化正则表达式。
I needed something to be programmable and handle punctuation, brackets, etc.
http://jsfiddle.net/AQvyd/
I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.
并非所有与 Javascript 引擎相关的 RegEx 实现都支持 unicode。
例如,在 IE 中使用的 Microsoft JScript 仅限于 ANSI。
Not all the implementations of RegEx associated with Javascript engines a unicode aware.
For example Microsofts JScript using in IE is limited to ANSI.
当您处理 Unicode 和自然语言单词时,您可能需要更加小心地处理边界,而不仅仅是使用
\b
。请参阅此答案 了解详细信息和方向。When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using
\b
. See this answer for details and directions.