Ruby 正则表达式提取单词

发布于 2024-12-15 18:32:30 字数 383 浏览 1 评论 0原文

我目前正在努力想出一个正则表达式,它可以将字符串分割成单词,其中单词被定义为由空格包围或用双引号括起来的字符序列。我正在使用 String#scan

例如,字符串:

'   hello "my name" is    "Tom"'

应该匹配单词:

hello
my name
is
Tom

我设法使用以下方法匹配用双引号括起来的单词:

/"([^\"]*)"/

但我不知道如何合并包围的单词通过空格字符来获取“你好”、“是”和“汤姆”,同时不会搞砸“我的名字”。

任何对此的帮助将不胜感激!

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan

For instance, the string:

'   hello "my name" is    "Tom"'

should match the words:

hello
my name
is
Tom

I managed to match the words enclosed in double quotes by using:

/"([^\"]*)"/

but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.

Any help with this would be appreciated!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦里人 2024-12-22 18:32:30
result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

会为你工作。它将打印

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just 忽略空字符串。

解释

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

您可以像这样使用reject来避免

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

打印空字符串

=> ["hello", "\"my name\"", "is", "\"Tom\""]
result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

will work for you. It will print

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just ignore the empty strings.

Explanation

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

You can use reject like this to avoid empty strings

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

prints

=> ["hello", "\"my name\"", "is", "\"Tom\""]
猫弦 2024-12-22 18:32:30
text = '   hello "my name" is    "Tom"'

text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}

产生:

hello
my name
is
Tom

解释:

0 个或多个空格后跟

单个

双引号内的一些单词或

单词

后跟 0 个或多个空格

text = '   hello "my name" is    "Tom"'

text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}

Produces:

hello
my name
is
Tom

Explanation:

0 or more spaces followed by

either

some words within double-quotes OR

a single word

followed by 0 or more spaces

困倦 2024-12-22 18:32:30

您可以尝试这个正则表达式:

/\b(\w+)\b/

它使用 \b 来查找单词边界。这个网站 http://rubular.com/ 很有帮助。

You can try this regex:

/\b(\w+)\b/

which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文