如何修改lxml自动链接更加自由?

发布于 2024-11-01 07:39:53 字数 607 浏览 6 评论 0原文

我正在使用伟大的 lxml 库的自动链接功能,如下所示: http:// lxml.de/api/lxml.html.clean-module.html

我的问题是它只检测以 http:// 开头的网址。 我想使用更广泛的 url 检测正则表达式,如下所示: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

我试图让该正则表达式与lxml 自动链接功能没有成功。 我总是最终得到一个:

lxml\html\clean.py", line 571, in _link_text
host = match.group('host')
IndexError: no such group

任何Python/正则表达式专家知道如何使这个工作吗?

I am using the autolink function of the great lxml library as documented here: http://lxml.de/api/lxml.html.clean-module.html

My problem is that it only detects urls that start with http://.
I would like to use a broader url detection regex like this one:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls

I tried to make that regex work with the lxml autolink function without success.
I always end up with a:

lxml\html\clean.py", line 571, in _link_text
host = match.group('host')
IndexError: no such group

Any python/regex gurus out there who know how to make this work?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

街角迷惘 2024-11-08 07:39:53

为了使正则表达式适应 lxml 的自动链接,需要做两件事。首先将整个 url 模式匹配包装在一个组 (?P .. ) 中 - 这让 lxml 知道 href="" 属性内部的内容。

接下来,将主机部分包装在 (?.. ) 组中,并在调用自动​​链接函数时传递 avoid_hosts=[] 参数。原因是您使用的正则表达式模式并不总是能找到主机(有时 host 部分将为 None),因为它匹配部分 url 和不明确的 url类似的图案。

我修改了正则表达式以包含上述更改并给出了一个测试用例片段:

import re
import lxml.html
import lxml.html.clean

url_regexp = re.compile(r"""(?i)\b(?P<body>(?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|(?P<host>[a-z0-9.\-]+[.][a-z]{2,4}/))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")

DOC = """<html><body>
    http://foo.com/blah_blah
    http://foo.com/blah_blah/.
    http://www.extinguishedscholar.com/wpglob/?p=364.
    http://✪df.ws/1234
    rdar://1234
    rdar:/1234
    message://%[email protected]%3e
    What about <mailto:[email protected]?subject=TEST> (including brokets).
    bit.ly/foo
</body></html>"""

tree = lxml.html.fromstring(DOC)
body = tree.find('body')
lxml.html.clean.autolink(body, [url_regexp], avoid_hosts=[])
print lxml.html.tostring(tree)

输出:

<html><body>
    <a href="http://foo.com/blah_blah">http://foo.com/blah_blah</a>
    <a href="http://foo.com/blah_blah/">http://foo.com/blah_blah/</a>.
    <a href="http://www.extinguishedscholar.com/wpglob/?p=364">http://www.extinguishedscholar.com/wpglob/?p=364</a>.
    <a href="http://%C3%A2%C2%9C%C2%AAdf.ws/1234">http://✪df.ws/1234</a>
    <a href="rdar://1234">rdar://1234</a>
    <a href="rdar:/1234">rdar:/1234</a>
    <a href="message://%[email protected]%3e">message://%[email protected]%3e</a>
    What about <<a href="mailto:[email protected]?subject=TEST">mailto:[email protected]?subject=TEST</a>>
    (including brackets).
    <a href="bit.ly/foo">bit.ly/foo</a>
</body></html>

There are two things to do in order to adapt the regexp to lxml's autolink. First wrap the entire url pattern match in a group (?P<body> .. ) - this lets lxml know what goes inside the href="" attribute.

Next, wrap the host part in a (?<host> .. ) group and pass avoid_hosts=[] parameter when you call the autolink function. The reason for this is the regexp pattern you're using doesn't always find a host (sometimes the host part will be None) since it matches partial urls and ambiguous url-like patterns.

I've modified the regexp to include the above changes and given a snippet test case:

import re
import lxml.html
import lxml.html.clean

url_regexp = re.compile(r"""(?i)\b(?P<body>(?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|(?P<host>[a-z0-9.\-]+[.][a-z]{2,4}/))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")

DOC = """<html><body>
    http://foo.com/blah_blah
    http://foo.com/blah_blah/.
    http://www.extinguishedscholar.com/wpglob/?p=364.
    http://✪df.ws/1234
    rdar://1234
    rdar:/1234
    message://%[email protected]%3e
    What about <mailto:[email protected]?subject=TEST> (including brokets).
    bit.ly/foo
</body></html>"""

tree = lxml.html.fromstring(DOC)
body = tree.find('body')
lxml.html.clean.autolink(body, [url_regexp], avoid_hosts=[])
print lxml.html.tostring(tree)

Output:

<html><body>
    <a href="http://foo.com/blah_blah">http://foo.com/blah_blah</a>
    <a href="http://foo.com/blah_blah/">http://foo.com/blah_blah/</a>.
    <a href="http://www.extinguishedscholar.com/wpglob/?p=364">http://www.extinguishedscholar.com/wpglob/?p=364</a>.
    <a href="http://%C3%A2%C2%9C%C2%AAdf.ws/1234">http://✪df.ws/1234</a>
    <a href="rdar://1234">rdar://1234</a>
    <a href="rdar:/1234">rdar:/1234</a>
    <a href="message://%[email protected]%3e">message://%[email protected]%3e</a>
    What about <<a href="mailto:[email protected]?subject=TEST">mailto:[email protected]?subject=TEST</a>>
    (including brackets).
    <a href="bit.ly/foo">bit.ly/foo</a>
</body></html>
秋意浓 2024-11-08 07:39:53

您确实没有提供足够的信息来确定,但我敢打赌您在格鲁伯的正则表达式中遇到了反斜杠的转义问题。尝试使用原始字符串(允许使用反斜杠而无需转义)和三引号(允许您在字符串中使用引号而无需转义它们)。例如

re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")

You don't really give enough information to be sure, but I bet that you're having escaping issues with the backslashes in Gruber's regex. Try using a raw string, which allows backslashes without escaping, and triple-quotes, which allow you to use quotes in the string without having to escape those either. E.g.

re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文