在 Javascript 中使用正则表达式对字符串进行标记
假设我有一个包含换行符和制表符的长字符串:
var x = "This is a long string.\n\t This is another one on next line.";
那么我们如何使用正则表达式将此字符串拆分为标记?
我不想使用 .split(' ')
因为我想学习 Javascript 的正则表达式。
更复杂的字符串可能是这样的:
var y = "This @is a #long $string. Alright, lets split this.";
现在我只想从此字符串中提取有效的单词,没有特殊字符和标点符号,即我想要这些:
var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];
var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];
Suppose I've a long string containing newlines and tabs as:
var x = "This is a long string.\n\t This is another one on next line.";
So how can we split this string into tokens, using regular expression?
I don't want to use .split(' ')
because I want to learn Javascript's Regex.
A more complicated string could be this:
var y = "This @is a #long $string. Alright, lets split this.";
Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:
var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];
var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这是您所要求的 jsfiddle 示例: http://jsfiddle.net/ayezutov/BjXw5/1/基本上
,代码非常简单:
更新:
基本上,您可以扩展分隔符字符列表: http://jsfiddle.net/ayezutov/BjXw5/2/
更新2:
始终只有字母:
http://jsfiddle.net/ayezutov/BjXw5/3/
Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/
Basically, the code is very simple:
UPDATE:
Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/
UPDATE 2:
Only letters all the time:
http://jsfiddle.net/ayezutov/BjXw5/3/
使用
\s+
对字符串进行标记。Use
\s+
to tokenize the string.exec 可以循环遍历匹配项以删除非单词 (\W) 字符。
exec can loop through the matches to remove non-word (\W) characters.
这是一个使用正则表达式组来使用不同类型的标记对文本进行标记的解决方案。
您可以在此处测试代码 https://jsfiddle.net/u3mvca6q/5/
Here is a solution using regex groups to tokenise the text using different types of tokens.
You can test the code here https://jsfiddle.net/u3mvca6q/5/
为了提取仅限单词的字符,我们使用
\w
符号。这是否与 Unicode 字符匹配取决于实现,您可以使用此参考 查看您的语言/库的情况。请参阅 Alexander Yezutov 的回答(更新 2),了解如何将其应用到表达式中。
In order to extract word-only characters, we use the
\w
symbol. Whether or not this will match Unicode characters or not is implementation-dependent, and you can use this reference to see what the case is for your language/library.Please see Alexander Yezutov's answer (update 2) on how to apply this into an expression.