在 Javascript 中使用正则表达式对字符串进行标记

发布于 2024-12-20 20:12:56 字数 604 浏览 3 评论 0原文

假设我有一个包含换行符和制表符的长字符串:

var x = "This is a long string.\n\t This is another one on next line.";

那么我们如何使用正则表达式将此字符串拆分为标记?

我不想使用 .split(' ') 因为我想学习 Javascript 的正则表达式。

更复杂的字符串可能是这样的:

var y = "This @is a #long $string. Alright, lets split this.";

现在我只想从此字符串中提取有效的单词,没有特殊字符和标点符号,即我想要这些:

var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];

var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];

Suppose I've a long string containing newlines and tabs as:

var x = "This is a long string.\n\t This is another one on next line.";

So how can we split this string into tokens, using regular expression?

I don't want to use .split(' ') because I want to learn Javascript's Regex.

A more complicated string could be this:

var y = "This @is a #long $string. Alright, lets split this.";

Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:

var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];

var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

影子是时光的心 2024-12-27 20:12:56

这是您所要求的 jsfiddle 示例: http://jsfiddle.net/ayezutov/BjXw5/1/基本上

,代码非常简单:

var y = "This @is a #long $string. Alright, lets split this.";
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"

var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
    document.write(match[i]);
    document.write('<br>');
}

更新
基本上,您可以扩展分隔符字符列表: http://jsfiddle.net/ayezutov/BjXw5/2/

var regex = /[^\s\.,!?]+/g;

更新2:
始终只有字母:
http://jsfiddle.net/ayezutov/BjXw5/3/

var regex = /\w+/g;

Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/

Basically, the code is very simple:

var y = "This @is a #long $string. Alright, lets split this.";
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"

var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
    document.write(match[i]);
    document.write('<br>');
}

UPDATE:
Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/

var regex = /[^\s\.,!?]+/g;

UPDATE 2:
Only letters all the time:
http://jsfiddle.net/ayezutov/BjXw5/3/

var regex = /\w+/g;
不念旧人 2024-12-27 20:12:56

使用 \s+ 对字符串进行标记。

Use \s+ to tokenize the string.

桃气十足 2024-12-27 20:12:56

exec 可以循环遍历匹配项以删除非单词 (\W) 字符。

var A= [], str= "This @is a #long $string. Alright, let's split this.",
rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words;

while((words= rx.exec(str))!= null){
    A.push(words[1]);
}
A.join(', ')

/*  returned value: (String)
This, is, a, long, string, Alright, let's, split, this
*/

exec can loop through the matches to remove non-word (\W) characters.

var A= [], str= "This @is a #long $string. Alright, let's split this.",
rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words;

while((words= rx.exec(str))!= null){
    A.push(words[1]);
}
A.join(', ')

/*  returned value: (String)
This, is, a, long, string, Alright, let's, split, this
*/
调妓 2024-12-27 20:12:56

这是一个使用正则表达式组来使用不同类型的标记对文本进行标记的解决方案。

您可以在此处测试代码 https://jsfiddle.net/u3mvca6q/5/

/*
Basic Regex explanation:
/                   Regex start
(\w+)               First group, words     \w means ASCII letter with \w     + means 1 or more letters
|                   or
(,|!)               Second group, punctuation
|                   or
(\s)                Third group, white spaces
/                   Regex end
g                   "global", enables looping over the string to capture one element at a time

Regex result:
result[0] : default group : any match
result[1] : group1 : words
result[2] : group2 : punctuation , !
result[3] : group3 : whitespace
*/
var basicRegex = /(\w+)|(,|!)|(\s)/g;

/*
Advanced Regex explanation:
[a-zA-Z\u0080-\u00FF] instead of \w     Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex

(\.\.\.|\.|,|!|\?)                      Identify ellipsis (...) and points as separate entities

You can improve it by adding ranges for special punctuation and so on
*/
var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g;

var basicString = "Hello, this is a random message!";
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end.";

console.log("------------------");
var result = null;
do {
    result = basicRegex.exec(basicString)
    console.log(result);
} while(result != null)

console.log("------------------");
var result = null;
do {
    result = advancedRegex.exec(advancedString)
    console.log(result);
} while(result != null)

/*
Output:
Array [ "Hello",        "Hello",        undefined,  undefined ]
Array [ ",",            undefined,      ",",        undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "this",         "this",         undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "is",           "is",           undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "a",            "a",            undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "random",       "random",       undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "message",      "message",      undefined,  undefined ]
Array [ "!",            undefined,      "!",        undefined ]
null
*/

Here is a solution using regex groups to tokenise the text using different types of tokens.

You can test the code here https://jsfiddle.net/u3mvca6q/5/

/*
Basic Regex explanation:
/                   Regex start
(\w+)               First group, words     \w means ASCII letter with \w     + means 1 or more letters
|                   or
(,|!)               Second group, punctuation
|                   or
(\s)                Third group, white spaces
/                   Regex end
g                   "global", enables looping over the string to capture one element at a time

Regex result:
result[0] : default group : any match
result[1] : group1 : words
result[2] : group2 : punctuation , !
result[3] : group3 : whitespace
*/
var basicRegex = /(\w+)|(,|!)|(\s)/g;

/*
Advanced Regex explanation:
[a-zA-Z\u0080-\u00FF] instead of \w     Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex

(\.\.\.|\.|,|!|\?)                      Identify ellipsis (...) and points as separate entities

You can improve it by adding ranges for special punctuation and so on
*/
var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g;

var basicString = "Hello, this is a random message!";
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end.";

console.log("------------------");
var result = null;
do {
    result = basicRegex.exec(basicString)
    console.log(result);
} while(result != null)

console.log("------------------");
var result = null;
do {
    result = advancedRegex.exec(advancedString)
    console.log(result);
} while(result != null)

/*
Output:
Array [ "Hello",        "Hello",        undefined,  undefined ]
Array [ ",",            undefined,      ",",        undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "this",         "this",         undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "is",           "is",           undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "a",            "a",            undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "random",       "random",       undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "message",      "message",      undefined,  undefined ]
Array [ "!",            undefined,      "!",        undefined ]
null
*/
陪你到最终 2024-12-27 20:12:56
var words = y.split(/[^A-Za-z0-9]+/);
var words = y.split(/[^A-Za-z0-9]+/);
唐婉 2024-12-27 20:12:56

为了提取仅限单词的字符,我们使用 \w 符号。这是否与 Unicode 字符匹配取决于实现,您可以使用此参考 查看您的语言/库的情况。

请参阅 Alexander Yezutov 的回答(更新 2),了解如何将其应用到表达式中。

In order to extract word-only characters, we use the \w symbol. Whether or not this will match Unicode characters or not is implementation-dependent, and you can use this reference to see what the case is for your language/library.

Please see Alexander Yezutov's answer (update 2) on how to apply this into an expression.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文