如何用空格和标点符号分割 JavaScript 字符串?

发布于 2024-11-10 06:14:46 字数 184 浏览 1 评论 0原文

我有一些随机字符串,例如:Hello, my name is john.。我希望将该字符串拆分为一个数组,如下所示:Hello, ,, , my, name, is, john, .,。我尝试了 str.split(/[^\w\s]|_/g),但它似乎不起作用。有什么想法吗?

I have some random string, for example: Hello, my name is john.. I want that string split into an array like this: Hello, ,, , my, name, is, john, .,. I tried str.split(/[^\w\s]|_/g), but it does not seem to work. Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

沙与沫 2024-11-17 06:14:47

要在任何非单词字符上分割 str,即不是 AZ、0-9 和下划线。

var words=str.split(/\W+/);  // assumes str does not begin nor end with whitespace

或者,假设您的目标语言是英语,您可以使用以下方法从字符串中提取所有语义有用的值(即“标记化”字符串),

var str='Here\'s a (good, bad, indifferent, ...) '+
        'example sentence to be used in this test '+
        'of English language "token-extraction".',

    punct='\\['+ '\\!'+ '\\"'+ '\\#'+ '\\

这会产生:


tokens=[ 
  'Here\'s',
  'a',
  '(',
  'good',
  ',',
  'bad',
  ',',
  'indifferent',
  ',',
  '...',
  ')',
  'example',
  'sentence',
  'to',
  'be',
  'used',
  'in',
  'this',
  'test',
  'of',
  'English',
  'language',
  '"',
  'token-extraction',
  '"',
  '.'
]

EDIT

也可用作 Github 要点

+ // since javascript does not '\\%'+ '\\&'+ '\\\''+ '\\('+ '\\)'+ // support POSIX character '\\*'+ '\\+'+ '\\,'+ '\\\\'+ '\\-'+ // classes, we'll need our '\\.'+ '\\/'+ '\\:'+ '\\;'+ '\\<'+ // own version of [:punct:] '\\='+ '\\>'+ '\\?'+ '\\@'+ '\\['+ '\\]'+ '\\^'+ '\\_'+ '\\`'+ '\\{'+ '\\|'+ '\\}'+ '\\~'+ '\\]', re=new RegExp( // tokenizer '\\s*'+ // discard possible leading whitespace '('+ // start capture group '\\.{3}'+ // ellipsis (must appear before punct) '|'+ // alternator '\\w+\\-\\w+'+ // hyphenated words (must appear before punct) '|'+ // alternator '\\w+\'(?:\\w+)?'+ // compound words (must appear before punct) '|'+ // alternator '\\w+'+ // other words '|'+ // alternator '['+punct+']'+ // punct ')' // end capture group ); // grep(ary[,filt]) - filters an array // note: could use jQuery.grep() instead // @param {Array} ary array of members to filter // @param {Function} filt function to test truthiness of member, // if omitted, "function(member){ if(member) return member; }" is assumed // @returns {Array} all members of ary where result of filter is truthy function grep(ary,filt) { var result=[]; for(var i=0,len=ary.length;i++<len;) { var member=ary[i]||''; if(filt && (typeof filt === 'Function') ? filt(member) : member) { result.push(member); } } return result; } var tokens=grep( str.split(re) ); // note: filter function omitted // since all we need to test // for is truthiness

这会产生:

EDIT

也可用作 Github 要点

To split a str on any run of non-word characters I.e. Not A-Z, 0-9, and underscore.

var words=str.split(/\W+/);  // assumes str does not begin nor end with whitespace

Or, assuming your target language is English, you can extract all semantically useful values from a string (i.e. "tokenizing" a string) using:

var str='Here\'s a (good, bad, indifferent, ...) '+
        'example sentence to be used in this test '+
        'of English language "token-extraction".',

    punct='\\['+ '\\!'+ '\\"'+ '\\#'+ '\\

which produces:


tokens=[ 
  'Here\'s',
  'a',
  '(',
  'good',
  ',',
  'bad',
  ',',
  'indifferent',
  ',',
  '...',
  ')',
  'example',
  'sentence',
  'to',
  'be',
  'used',
  'in',
  'this',
  'test',
  'of',
  'English',
  'language',
  '"',
  'token-extraction',
  '"',
  '.'
]

EDIT

Also available as a Github Gist

+ // since javascript does not '\\%'+ '\\&'+ '\\\''+ '\\('+ '\\)'+ // support POSIX character '\\*'+ '\\+'+ '\\,'+ '\\\\'+ '\\-'+ // classes, we'll need our '\\.'+ '\\/'+ '\\:'+ '\\;'+ '\\<'+ // own version of [:punct:] '\\='+ '\\>'+ '\\?'+ '\\@'+ '\\['+ '\\]'+ '\\^'+ '\\_'+ '\\`'+ '\\{'+ '\\|'+ '\\}'+ '\\~'+ '\\]', re=new RegExp( // tokenizer '\\s*'+ // discard possible leading whitespace '('+ // start capture group '\\.{3}'+ // ellipsis (must appear before punct) '|'+ // alternator '\\w+\\-\\w+'+ // hyphenated words (must appear before punct) '|'+ // alternator '\\w+\'(?:\\w+)?'+ // compound words (must appear before punct) '|'+ // alternator '\\w+'+ // other words '|'+ // alternator '['+punct+']'+ // punct ')' // end capture group ); // grep(ary[,filt]) - filters an array // note: could use jQuery.grep() instead // @param {Array} ary array of members to filter // @param {Function} filt function to test truthiness of member, // if omitted, "function(member){ if(member) return member; }" is assumed // @returns {Array} all members of ary where result of filter is truthy function grep(ary,filt) { var result=[]; for(var i=0,len=ary.length;i++<len;) { var member=ary[i]||''; if(filt && (typeof filt === 'Function') ? filt(member) : member) { result.push(member); } } return result; } var tokens=grep( str.split(re) ); // note: filter function omitted // since all we need to test // for is truthiness

which produces:

EDIT

Also available as a Github Gist

本宫微胖 2024-11-17 06:14:47

试试这个(我不确定这是否是您想要的):

str.replace(/[^\w\s]|_/g, function ($1) { return ' ' + $1 + ' ';}).replace(/[ ]+/g, ' ').split(' ');

http://jsfiddle.net/zNHJW/ 3/

Try this (I'm not sure if this is what you wanted):

str.replace(/[^\w\s]|_/g, function ($1) { return ' ' + $1 + ' ';}).replace(/[ ]+/g, ' ').split(' ');

http://jsfiddle.net/zNHJW/3/

2024-11-17 06:14:47

尝试:

str.split(/([_\W])/)

这将被任何非字母数字字符 (\W) 和任何下划线分隔。它使用捕获括号来包含最终结果中分割的项目。

Try:

str.split(/([_\W])/)

This will split by any non-alphanumeric character (\W) and any underscore. It uses capturing parentheses to include the item that was split by in the final result.

鲜肉鲜肉永远不皱 2024-11-17 06:14:47

这个解决方案给我带来了空间方面的挑战(仍然需要它们),然后我尝试了 str.split(/\b/) ,一切都很好。数组中输出空格,不难忽略,标点符号后面剩下的可以删掉。

This solution caused a challenge with spaces for me (still needed them), then I gave str.split(/\b/) a shot and all is well. Spaces are output in the array, which won't be hard to ignore, and the ones left after punctuation can be trimmed out.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文