提高我的正则表达能力

发布于 2024-07-26 19:25:56 字数 748 浏览 2 评论 0原文

我想提高我的正则表达式技能已经有一段时间了,并且“掌握正则表达式”被推荐了很多次,所以我买了它并在过去一天左右的时间里一直在阅读它。

我创建了以下正则表达式:

^(?:<b>)?(?:^<i>)?<a href="/site\.php\?id=([0-9]*)">(.*?) \(([ a-z0-9]{2,10})\)</a>(?:^</i>)?(?:</b>)?$

它匹配前两个链接,但忽略 标记包含的两个链接。 它提取 id、标题和类型。

<a href="/site.php?id=6321">site 1 title (type 1)</a>
<b><a href="/site.php?id=10254">site 2 title (type 2)</a></b>

<i><a href="/site.php?id=5479">site 3 title (type 3)</a></i>
<b><i><a href="/site.php?id=325">site 4 title (type 4)</a></i></b>

虽然它有效,但对于这么简单的东西来说似乎相当长,它可以改进吗?

I've been wanting to improve my regex skills for quite some time now and "Mastering Regular Expressions" was recommended quite a few times so I bought it and have been reading it over the past day or so.

I have created the following regular expression:

^(?:<b>)?(?:^<i>)?<a href="/site\.php\?id=([0-9]*)">(.*?) \(([ a-z0-9]{2,10})\)</a>(?:^</i>)?(?:</b>)?$

Which matches the first two links but ignores the two enclosed by an <i> tag.
It extracts the id, title and type.

<a href="/site.php?id=6321">site 1 title (type 1)</a>
<b><a href="/site.php?id=10254">site 2 title (type 2)</a></b>

<i><a href="/site.php?id=5479">site 3 title (type 3)</a></i>
<b><i><a href="/site.php?id=325">site 4 title (type 4)</a></i></b>

Although it works, it seems fairly long for something so simple, could it be improved?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

南街九尾狐 2024-08-02 19:25:56

由于缺少使用字符类(\d 表示 0-9 等),我不认为有问题的正则表达式可以缩短很多; 然而...

顺便说一句,值得一提的是,用正则表达式解析 HTML 充其量是危险的; 当处理 HTML(以及较小程度上的 XML)时,DOM 工具通常更适合。

Short of using character classes (\d for 0-9 etc.) I don't see that the regular expression in question could be shortened much; however...

As a side note it can be worth mentioning that parsing HTML with regular expressions is hazardous at best; when dealing with HTML (and to a lesser extent XML), DOM tools are generally better suited.

浅语花开 2024-08-02 19:25:56

如果你像Whilliham那样编写屏幕截图,正确地提到DOM可能是正则表达式的合适解析器,因为HTML比正则表达式更宽容。

没有缩短太多,但正则表达式更宽容一点

  • 删除了字符串的开头和结尾
    字符串检查,你真的需要吗
    他们?
  • 负向后查找以确保 前面不会出现
  • 使用 \d 简单断言而不是 [0-9] 更干净。
  • 您输入了 3 到 11 个字符,我将其更改为 3 个或更多。
  • 删除了对结束标签的检查,它们对您的屏幕截图程序没有上下文意义(大概)。

(?)(.*?) \(([ az\d]{2 ,})\)

If your writing screenscrappers as Whilliham rightfully mentions DOM might just be a suitable parser as Regex since HTML is alot more forgiving then regex.

Not shortened by much but a bit the regex is more forgiving

  • Removed start of string and end of
    string checks, did you really need
    them?
  • negative lookbehind to make sure <a> is not preceeded by <i>
  • use of \d simple asertation instead of [0-9] tad cleaner.
  • You had type in for 3 to 11 characters long i changed it to 3 or more.
  • removed checks for end tags they serve no contextual meaning for your screenscrapper (presumably).

(?<!<i>)<a href="/site.php\?id=(\d*)">(.*?) \(([ a-z\d]{2,})\)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文