当前位置：文江博客话题详情

两个字符串之间的正则表达式 - 包括最后一个字符串

发布于 2025-01-11 18:20:11 字数 279 浏览 0 评论 0原文

所以我试图从文本文件中提取所有域。它们的开头可能有一个特殊字符（如字体标签）。

迄今为止： (?<=>).*?(?=com|net)

我正在搜索的文本：

thisdomain.com 假文本>thatdomain.net

当前正在查找“thisdomain 和 thatdomain，但当然它会切断域扩展名。我已经研究了正则表达式文档大约一个小时，但找不到在> 和 .com 而不切断 .com 有什么建议吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

丿*梦醉红颜 2025-01-18 18:20:11

使用

(?<=>).*?(?:com|net)

请参阅正则表达式证明。

说明

--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    com                      'com'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    net                      'net'
--------------------------------------------------------------------------------
  )                        end of grouping

Use

(?<=>).*?(?:com|net)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    com                      'com'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    net                      'net'
--------------------------------------------------------------------------------
  )                        end of grouping

回复收藏 0 原文

时光沙漏 2025-01-18 18:20:11

您不匹配 .com 的原因是因为这部分 (?=com|net) 是一个非消耗性的断言。

相反，您可以只匹配 .com 或 .net ，包括阻止匹配的点，例如 >clarinet 或 >romcom

(?<=>)\S+\.(?:com|net)

模式匹配：

(?<= 正向前瞻，断言当前位置左侧的内容
- > 按字面意思匹配
) 关闭lookbehind
\S+ 匹配 1 个或多个非空白字符
\.(?:com|net) 匹配 .com 或 .net

查看正则表达式演示。

不使用环视，您还可以使用捕获组并匹配 >

>(\S+\.(?:com|net))

查看另一个正则表达式演示。

The reason you are not matching .com is because this part (?=com|net) is an assertion which is non consuming.

Instead you can just match either .com or .net including the dot preventing to match for example >clarinet or >romcom

(?<=>)\S+\.(?:com|net)

The pattern matches:

(?<= Positive lookahead, assert what is directly to the left of the current position
- > Match literally
) Close the lookbehind
\S+ Match 1 or more non whitespace characters
\.(?:com|net) Match either .com or .net

See a regex demo.

Without using lookarounds, you can also use a capture group and match the >