Unicode property escapes - JavaScript 编辑
Unicode property escapes Regular Expressions allows for matching characters based on their Unicode properties. A character is described by several properties which are either binary ("boolean-like") or non-binary. For instance, unicode property escapes can be used to match emojis, punctuations, letters (even letters from specific languages or scripts), etc.
The source for this interactive example is stored in a GitHub repository. If you'd like to contribute to the interactive examples project, please clone https://github.com/mdn/interactive-examples and send us a pull request.
Note: For Unicode property escapes to work, a regular expression must use the u
flag which indicates a string must be considered as a series of Unicode code points. See also RegExp.prototype.unicode
.
Note: Some Unicode properties encompasses much more characters than some character classes (such as \w
which matches only latin letters, a
to z
) but the latter is better supported among browsers (as of January 2020).
Syntax
The following section is also duplicated on this cheatsheet. Do not forget to edit it as well, thanks!// Non-binary values
\p{UnicodePropertyValue}
\p{UnicodePropertyName=UnicodePropertyValue}
// Binary and non-binary values
\p{UnicodeBinaryPropertyName}
// Negation: \P is negated \p
\P{UnicodePropertyValue}
\P{UnicodeBinaryPropertyName}
- General_Category (
gc
) - Script (
sc
) - Script_Extensions (
scx
)
See also PropertyValueAliases.txt
- UnicodeBinaryPropertyName
- The name of a binary property. E.g.:
ASCII
,Alpha
,Math
,Diacritic
,Emoji
,Hex_Digit
,Math
,White_space
, etc. See Unicode Data PropList.txt for more info. - UnicodePropertyName
- The name of a non-binary property:
- UnicodePropertyValue
- One of the tokens listed in the Values section, below. Many values have aliases or shorthand (e.g. the value
Decimal_Number
for theGeneral_Category
property may be writtenNd
,digit
, orDecimal_Number
). For most values, theUnicodePropertyName
part and equals sign may be omitted. If aUnicodePropertyName
is specified, the value must correspond to the property type given.
Note: As there are many properties and values available, we will not describe them exhaustively here but rather provide various examples
Rationale
Before ES2018 there was no performance-efficient way to match characters from different sets based on scripts
(like Macedonian, Greek, Georgian etc.) or propertyName
(like Emoji etc) in JavaScript. Check out tc39 Proposal on Unicode Property Escapes for more info.
Examples
General categories
General categories are used to classify Unicode characters and subcategories are available to define a more precise categorization. It is possible to use both short or long forms in Unicode property escapes.
They can be used to match letters, numbers, symbols, punctuations, spaces, etc. For a more exhaustive list of general categories, please refer to the Unicode specification.
// finding all the letters of a text
let story = "It’s the Cheshire Cat: now I shall have somebody to talk to.";
// Most explicit form
story.match(/\p{General_Category=Letter}/gu);
// It is not mandatory to use the property name for General categories
story.match(/\p{Letter}/gu);
// This is equivalent (short alias):
story.match(/\p{L}/gu);
// This is also equivalent (conjunction of all the subcategories using short aliases)
story.match(/\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}/gu);
Scripts and script extensions
Some languages use different scripts for their writing system. For instance, English and Spanish are written using the Latin script while Arabic and Russian are written with other scripts (respectively Arabic and Cyrillic). The Script
and Script_Extensions
Unicode properties allow regular expression to match characters according to the script they are mainly used with (Script
) or according to the set of scripts they belong to (Script_Extensions
).
For example, A
belongs to the Latin
script and ε
to the Greek
script.
let mixedCharacters = "aεЛ";
// Using the canonical "long" name of the script
mixedCharacters.match(/\p{Script=Latin}/u); // a
// Using a short alias for the script
mixedCharacters.match(/\p{Script=Greek}/u); // ε
// Using the short name Sc for the Script property
mixedCharacters.match(/\p{Sc=Cyrillic}/u); // Л
For more details, please refer to the Unicode specification and the Scripts table in the ECMAScript specification.
If a character is used in a limited set of scripts, the Script
property will only match for the "predominant" used script. If we want to match characters based on a "non-predominant" script, we could use the Script_Extensions
property (Scx
for short).
// ٢ is the digit 2 in Arabic-Indic notation
// while it is predominantly written within the Arabic script
// it can also be written in the Thaana script
"٢".match(/\p{Script=Thaana}/u);
// null as Thaana is not the predominant script super()
"٢".match(/\p{Script_Extensions=Thaana}/u);
// ["٢", index: 0, input: "٢", groups: undefined]
Unicode property escapes vs. character classes
With JavaScript regular expressions, it is also possible to use character classes and especially \w
or \d
to match letters or digits. However, such forms only match characters from the Latin script (in other words, a
to z
and A
to Z
for \w
and 0
to 9
for \d
). As shown in this example, it might be a bit clumsy to work with non Latin texts.
Unicode property escapes categories encompass much more characters and \p{Letter}
or \p{Number}
will work for any script.
// Trying to use ranges to avoid \w limitations:
const nonEnglishText = "Приключения Алисы в Стране чудес";
const regexpBMPWord = /([\u0000-\u0019\u0021-\uFFFF])+/gu;
// BMP goes through U+0000 to U+FFFF but space is U+0020
console.table(nonEnglishText.match(regexpBMPWord));
// Using Unicode property escapes instead
const regexpUPE = /\p{L}+/gu;
console.table(nonEnglishText.match(regexpUPE));
Specifications
Specification |
---|
ECMAScript (ECMA-262) The definition of 'RegExp: Unicode property escapes' in that specification. |
Browser compatibility
For browser compatibility information, check out the main Regular Expressions compatibility table.
See also
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论