html 文本中链接的正则表达式
我希望这个问题不是 RTFM 问题。
我正在尝试编写一个从标准 HTML 网页( 标签)中提取链接的 Python 脚本。
我在网上搜索了匹配的正则表达式,发现了许多不同的模式。 是否有任何商定的标准正则表达式来匹配链接?
亚当
更新: 我实际上正在寻找两个不同的答案:
- 解析 HTML 链接的库解决方案是什么。 Beautiful Soup 似乎是一个很好的解决方案(谢谢,
Igal Serban
和cletus
!) - 可以使用正则表达式定义链接吗?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
HTML 的正则表达式变得混乱。 只需使用像 Beautiful Soup 这样的 DOM 解析器即可。
Regexes with HTML get messy. Just use a DOM parser like Beautiful Soup.
正如其他人所建议的,如果不需要类似实时的性能,BeautifulSoup 是一个很好的解决方案:
至于第二个问题,是的,HTML 链接应该是明确定义的,但是您实际遇到的 HTML 不太可能标准。 BeautifulSoup 的美妙之处在于它使用类似浏览器的启发式方法来尝试解析您可能实际遇到的非标准、格式错误的 HTML。
如果您确定要使用标准 XHTML,则可以使用(快得多)的 XML 解析器,例如 expat。
由于上述原因(解析器必须维护状态,而正则表达式不能做到这一点),正则表达式永远不会成为通用解决方案。
As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:
As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.
If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.
Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.
不,没有。
您可以考虑使用Beautiful Soup。 你可以称其为解析html文件的标准。
No there isn't.
You can consider using Beautiful Soup. You can call it the standard for parsing html files.
不,[X]HTML 在一般情况下不能用正则表达式解析。 考虑这样的例子:
这只是一些随机的有效例子; 如果您必须处理现实世界的标签汤 HTML,则存在一百万种格式错误的可能性。
如果您知道并且可以依赖目标页面的确切输出格式,则可以使用正则表达式。 否则抓取网页就完全是错误的选择。
No, [X]HTML is not in the general case parseable with regex. Consider examples like:
and that's just a few random valid examples; if you have to cope with real-world tag-soup HTML there are a million malformed possibilities.
If you know and can rely on the exact output format of the target page you can get away with regex. Otherwise it is completely the wrong choice for scraping web pages.
我赞同PEZ的回答:
据我所知,任何 HTML 标签都可以包含任意数量的嵌套标签。 例如:
因此,原则上,要正确匹配标签,您必须至少能够匹配以下形式的字符串:
其中 B 表示标签的开头,E 表示标签的结尾。 也就是说,您必须能够匹配由任意数量的 B 后跟相同数量的 E 组成的字符串。 为此,您的匹配器必须能够“计数”,而正则表达式(即有限状态自动机)根本无法做到这一点(为了计数,自动机至少需要一个堆栈)。 参考PEZ的回答,HTML是上下文无关语法,而不是常规语言。
I second PEZ's answer:
As far as I know, any HTML tag may contain any number of nested tags. For example:
Thus, in principle, to match a tag properly you must be able at least to match strings of the form:
where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the same number of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.
这在一定程度上取决于 HTML 的生成方式。 如果它受到一定程度的控制,你可以逃脱:
It depends a bit on how the HTML is produced. If it's somewhat controlled you can get away with:
在那里回答你的两个子问题。
Answering your two subquestions there.
对于问题#2(链接不应该是定义明确的正则表达式),答案是……不。
HTML 链接结构是一种递归结构,很像编程语言中的括号和大括号。 必须有相同数量的开始和结束结构,并且“链接”表达式可以嵌套在其自身内。
为了正确匹配“链接”表达式,需要正则表达式来计算开始和结束标签的数量。 正则表达式是一类有限自动机。 根据定义,有限自动机无法“计算”模式内的构造。 需要语法来描述这样的递归数据结构。 正则表达式无法“计数”,这就是为什么您会看到用语法而不是正则表达式描述的编程语言。
因此不可能创建一个 100% 积极匹配所有“链接”表达式的正则表达式。 当然,有些正则表达式可以高精度地匹配大量“链接”,但它们永远不会完美。
我最近写了一篇关于这个问题的博客文章。 正则表达式限制
In response to question #2 (shouldn't a link be a well defined regular expression) the answer is ... no.
An HTML link structure is a recursive much like parens and braces in programming languages. There must be an equal number of start and end constructs and the "link" expression can be nested within itself.
To properly match a "link" expression a regex would be required to count the start and end tags. Regular expressions are a class of Finite Automata. By definition a Finite Automata cannot "count" constructs within a pattern. A grammar is required to describe a recursive data structure such as this. The inability for a regex to "count" is why you see programming languages described with Grammars as opposed to regular expressions.
So it is not possible to create a regex that will positively match 100% of all "link" expressions. There are certainly regex's that will match a good deal of "link"'s with a high degree of accuracy but they won't ever be perfect.
I wrote a blog article about this problem recently. Regular Expression Limitations