html 文本中链接的正则表达式

发布于 2024-07-11 04:04:29 字数 464 浏览 9 评论 0 原文

我希望这个问题不是 RTFM 问题。 我正在尝试编写一个从标准 HTML 网页( 标签)中提取链接的 Python 脚本。 我在网上搜索了匹配的正则表达式,发现了许多不同的模式。 是否有任何商定的标准正则表达式来匹配链接?

亚当

更新: 我实际上正在寻找两个不同的答案:

  1. 解析 HTML 链接的库解决方案是什么。 Beautiful Soup 似乎是一个很好的解决方案(谢谢,Igal Serbancletus!)
  2. 可以使用正则表达式定义链接​​吗?

I hope this question is not a RTFM one.
I am trying to write a Python script that extracts links from a standard HTML webpage (the <link href... tags).
I have searched the web for matching regexen and found many different patterns. Is there any agreed, standard regex to match links?

Adam

UPDATE:
I am actually looking for two different answers:

  1. What's the library solution for parsing HTML links. Beautiful Soup seems to be a good solution (thanks, Igal Serban and cletus!)
  2. Can a link be defined using a regex?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

兔小萌 2024-07-18 04:04:29

HTML 的正则表达式变得混乱。 只需使用像 Beautiful Soup 这样的 DOM 解析器即可。

Regexes with HTML get messy. Just use a DOM parser like Beautiful Soup.

初熏 2024-07-18 04:04:29

正如其他人所建议的,如果不需要类似实时的性能,BeautifulSoup 是一个很好的解决方案:

import urllib2
from BeautifulSoup import BeautifulSoup

html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")

至于第二个问题,是的,HTML 链接应该是明确定义的,但是您实际遇到的 HTML 不太可能标准。 BeautifulSoup 的美妙之处在于它使用类似浏览器的启发式方法来尝试解析您可能实际遇到的非标准、格式错误的 HTML。

如果您确定要使用标准 XHTML,则可以使用(快得多)的 XML 解析器,例如 expat。

由于上述原因(解析器必须维护状态,而正则表达式不能做到这一点),正则表达式永远不会成为通用解决方案。

As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

import urllib2
from BeautifulSoup import BeautifulSoup

html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")

As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.

梨涡 2024-07-18 04:04:29

不,没有。

您可以考虑使用Beautiful Soup。 你可以称其为解析html文件的标准。

No there isn't.

You can consider using Beautiful Soup. You can call it the standard for parsing html files.

无所的.畏惧 2024-07-18 04:04:29

链接不应该是一个明确定义的正则表达式吗?

不,[X]HTML 在一般情况下不能用正则表达式解析。 考虑这样的例子:

<link title='hello">world' href="x">link</link>
<!-- <link href="x">not a link</link> -->
<![CDATA[ ><link href="x">not a link</link> ]]>
<script>document.write('<link href="x">not a link</link>')</script>

这只是一些随机的有效例子; 如果您必须处理现实世界的标签汤 HTML,则存在一百万种格式错误的可能性。

如果您知道并且可以依赖目标页面的确切输出格式,则可以使用正则表达式。 否则抓取网页就完全是错误的选择。

Shoudln't a link be a well-defined regex?

No, [X]HTML is not in the general case parseable with regex. Consider examples like:

<link title='hello">world' href="x">link</link>
<!-- <link href="x">not a link</link> -->
<![CDATA[ ><link href="x">not a link</link> ]]>
<script>document.write('<link href="x">not a link</link>')</script>

and that's just a few random valid examples; if you have to cope with real-world tag-soup HTML there are a million malformed possibilities.

If you know and can rely on the exact output format of the target page you can get away with regex. Otherwise it is completely the wrong choice for scraping web pages.

各自安好 2024-07-18 04:04:29

链接不应该是一个明确定义的正则表达式吗? 这是一个相当理论化的问题,

我赞同PEZ的回答:

我认为 HTML 不适合“定义良好”的正则表达式,因为它不是一种正则语言。

据我所知,任何 HTML 标签都可以包含任意数量的嵌套标签。 例如:

<a href="http://stackoverflow.com">stackoverflow</a>
<a href="http://stackoverflow.com"><i>stackoverflow</i></a>
<a href="http://stackoverflow.com"><b><i>stackoverflow</i></b></a>
...

因此,原则上,要正确匹配标签,您必须至少能够匹配以下形式的字符串:

BE
BBEE
BBBEEE
...
BBBBBBBBBBEEEEEEEEEE
...

其中 B 表示标签的开头,E 表示标签的结尾。 也就是说,您必须能够匹配由任意数量的 B 后跟相同数量的 E 组成的字符串。 为此,您的匹配器必须能够“计数”,而正则表达式(即有限状态自动机)根本无法做到这一点(为了计数,自动机至少需要一个堆栈)。 参考PEZ的回答,HTML是上下文无关语法,而不是常规语言。

Shoudln't a link be a well-defined regex? This is a rather theoretical question,

I second PEZ's answer:

I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.

As far as I know, any HTML tag may contain any number of nested tags. For example:

<a href="http://stackoverflow.com">stackoverflow</a>
<a href="http://stackoverflow.com"><i>stackoverflow</i></a>
<a href="http://stackoverflow.com"><b><i>stackoverflow</i></b></a>
...

Thus, in principle, to match a tag properly you must be able at least to match strings of the form:

BE
BBEE
BBBEEE
...
BBBBBBBBBBEEEEEEEEEE
...

where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the same number of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.

简单 2024-07-18 04:04:29

这在一定程度上取决于 HTML 的生成方式。 如果它受到一定程度的控制,你可以逃脱:

re.findall(r'''<link\s+.*?href=['"](.*?)['"].*?(?:</link|/)>''', html, re.I)

It depends a bit on how the HTML is produced. If it's somewhat controlled you can get away with:

re.findall(r'''<link\s+.*?href=['"](.*?)['"].*?(?:</link|/)>''', html, re.I)
蘑菇王子 2024-07-18 04:04:29

在那里回答你的两个子问题。

  1. 我有时会子类化 SGMLParser(包含在核心 Python 发行版中)并且必须说它很简单。
  2. 我不认为 HTML 适合“定义良好”的正则表达式,因为它不是一种正则语言。

Answering your two subquestions there.

  1. I've sometimes subclassed SGMLParser (included in the core Python distribution) and must say it's straight forward.
  2. I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.
柠北森屋 2024-07-18 04:04:29

对于问题#2(链接不应该是定义明确的正则表达式),答案是……不。

HTML 链接结构是一种递归结构,很像编程语言中的括号和大括号。 必须有相同数量的开始和结束结构,并且“链接”表达式可以嵌套在其自身内。

为了正确匹配“链接”表达式,需要正则表达式来计算开始和结束标签的数量。 正则表达式是一类有限自动机。 根据定义,有限自动机无法“计算”模式内的构造。 需要语法来描述这样的递归数据结构。 正则表达式无法“计数”,这就是为什么您会看到用语法而不是正则表达式描述的编程语言。

因此不可能创建一个 100% 积极匹配所有“链接”表达式的正则表达式。 当然,有些正则表达式可以高精度地匹配大量“链接”,但它们永远不会完美。

我最近写了一篇关于这个问题的博客文章。 正则表达式限制

In response to question #2 (shouldn't a link be a well defined regular expression) the answer is ... no.

An HTML link structure is a recursive much like parens and braces in programming languages. There must be an equal number of start and end constructs and the "link" expression can be nested within itself.

To properly match a "link" expression a regex would be required to count the start and end tags. Regular expressions are a class of Finite Automata. By definition a Finite Automata cannot "count" constructs within a pattern. A grammar is required to describe a recursive data structure such as this. The inability for a regex to "count" is why you see programming languages described with Grammars as opposed to regular expressions.

So it is not possible to create a regex that will positively match 100% of all "link" expressions. There are certainly regex's that will match a good deal of "link"'s with a high degree of accuracy but they won't ever be perfect.

I wrote a blog article about this problem recently. Regular Expression Limitations

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文