为什么 \w 在 javascript 正则表达式中只匹配英文单词?
我正在尝试使用 javascript 代码在某些文本中查找 URL。 问题是,我使用的正则表达式使用 \w 来匹配 URL 中的字母和数字,但它不匹配非英语字符(在我的例子中是希伯来字母)。
那么我可以使用什么来代替 \w 来匹配所有语言中的所有字母呢?
I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).
So what can I use instead of \w to match all letters in all languages?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我刚刚发现了 XRegExp ,它还没有被提及,我对它印象深刻。 它是一种替代的正则表达式实现,具有 unicode 插件,并根据 MIT 许可证获得许可。
根据该网站,要匹配 unicode 字符,您可以使用以下代码:
I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.
According to the website, to match unicode chars, you'd use such code:
试试这个 \p{L}
unicode 正则表达式到字母
Try this \p{L}
the unicode regex to Letters
也许 \S (非空白)。
Perhaps \S (non-whitespace).
看看 http://www.regular-expressions.info/refunicode.html。
看起来 unicode 没有 \w 等效项,但您可以匹配单个 unicode 字母,因此您可以创建它。
Have a look at http://www.regular-expressions.info/refunicode.html.
It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.
检查这个关于 JavaScript 和 Unicode 的问题。 看起来 Jan Goyvaerts 的回答给你带来了一些希望。
编辑:但似乎所有浏览器都不支持 \p ... 无论如何。 这个问题应该包含有用的信息。
Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.
Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.
如果您生成的网址中包含非英文字母,您可能需要重新考虑。
如果我正确解释 W3C,URL 只能包含以下单词字符拉丁字母。
If you're the one generating URLs with non-english letters in it, you may want to reconsider.
If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.
请注意,W3C 指定 URI(作为 URL 的超集)仅允许 US-ASCII 字符。
通常所有其他字符应该用百分比表示法表示:
一般情况下是这样的当您在浏览器中打开包含非 ASCII 字符的 URL 时会发生这种情况,它们会被转换为 %AB 表示法,而该表示法又 US-ASCII。
如果可以影响材料的创建方式,最好的选择是在创建过程中将 URL 置于 urlencode() 类型函数的控制之下。
Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters.
Normally all other characters should be represented by percent-notation:
Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.
If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.
因为
\w
仅匹配 ASCII 字符 48-57 ('0'-'9')、67-90 ('A'-'Z') 和 97-122 ('a'-'z') ')。 希伯来语字符和其他特殊外语字符(例如变音符号-o 或波形符-n)不在该范围内。您可能最好寻找描述单词的字符(空格、引号和其他标点符号),而不是匹配外语字符(外语字符非常多,位于许多不同的 ASCII 范围内)。
Because
\w
only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.
ECMA 262 v3 标准定义了通常称为 JavaScript 的编程语言,规定
\w
应等效于 [a-zA-Z0-9_],并且\d
> 应等于 [0-9]。 另一方面,根据标准,\s
匹配 ASCII 和 Unicode 空白。JavaScript 也不支持用于匹配 Unicode 内容的
\p
语法,因此没有一个好的方法来做到这一点。 您可以将所有希伯来语字符与:这仅匹配希伯来语块中的任何代码点。
您可以将任何 ASCII 单词字符或任何希伯来语字符与:
The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that
\w
should be equivalent to [a-zA-Z0-9_] and that\d
should be equivalent to [0-9].\s
on the other hand matches both ASCII and Unicode whitespace, according to the standard.JavaScript does not support the
\p
syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:This simply matches any code point in the Hebrew block.
You can match any ASCII word character or any Hebrew character with:
我认为您正在寻找这个正则表达式:
I think you are looking for this regex: