Ruby 正则表达式提取单词

发布于 2024-12-15 18:32:30 字数 383 浏览 1 评论 0原文

我目前正在努力想出一个正则表达式，它可以将字符串分割成单词，其中单词被定义为由空格包围或用双引号括起来的字符序列。我正在使用 String#scan

例如，字符串：

'   hello "my name" is    "Tom"'

应该匹配单词：

hello
my name
is
Tom

我设法使用以下方法匹配用双引号括起来的单词：

/"([^\"]*)"/

但我不知道如何合并包围的单词通过空格字符来获取“你好”、“是”和“汤姆”，同时不会搞砸“我的名字”。

任何对此的帮助将不胜感激！

原文

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan

For instance, the string:

'   hello "my name" is    "Tom"'

should match the words:

hello
my name
is
Tom

I managed to match the words enclosed in double quotes by using:

/"([^\"]*)"/

but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.

Any help with this would be appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里人 2024-12-22 18:32:30

result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

会为你工作。它将打印

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just 忽略空字符串。

解释

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

您可以像这样使用reject来避免

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

打印空字符串

=> ["hello", "\"my name\"", "is", "\"Tom\""]

result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

will work for you. It will print

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just ignore the empty strings.

Explanation

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

You can use reject like this to avoid empty strings

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

prints

=> ["hello", "\"my name\"", "is", "\"Tom\""]

回复收藏 0 原文

猫弦 2024-12-22 18:32:30

text = '   hello "my name" is    "Tom"'

text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}

产生：

hello
my name
is
Tom

解释：

0 个或多个空格后跟

单个

双引号内的一些单词或

单词

后跟 0 个或多个空格

text = '   hello "my name" is    "Tom"'

text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}

Produces:

hello
my name
is
Tom

Explanation:

0 or more spaces followed by

either

some words within double-quotes OR

a single word

followed by 0 or more spaces

回复收藏 0 原文

困倦 2024-12-22 18:32:30

您可以尝试这个正则表达式：

/\b(\w+)\b/

它使用 \b 来查找单词边界。这个网站 http://rubular.com/ 很有帮助。

You can try this regex:

/\b(\w+)\b/

which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

回复收藏 0 原文

~没有更多了~

关于作者

如若梦似彩虹

暂无简介

文章

26 人气

关注发私信

饮湿

文章 0 评论 0

关注

明月

文章 0 评论 0

关注

02

文章 0 评论 0

关注

hs1283

文章 0 评论 0

关注

风向决定发型

文章 0 评论 0

关注

落花浅忆

文章 0 评论 0

友情链接

文江博客

Ruby 正则表达式提取单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

Ruby 正则表达式提取单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。