如何修改lxml自动链接更加自由?
我正在使用伟大的 lxml 库的自动链接功能,如下所示: http:// lxml.de/api/lxml.html.clean-module.html
我的问题是它只检测以 http:// 开头的网址。 我想使用更广泛的 url 检测正则表达式,如下所示: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
我试图让该正则表达式与lxml 自动链接功能没有成功。 我总是最终得到一个:
lxml\html\clean.py", line 571, in _link_text
host = match.group('host')
IndexError: no such group
任何Python/正则表达式专家知道如何使这个工作吗?
I am using the autolink function of the great lxml library as documented here: http://lxml.de/api/lxml.html.clean-module.html
My problem is that it only detects urls that start with http://.
I would like to use a broader url detection regex like this one:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I tried to make that regex work with the lxml autolink function without success.
I always end up with a:
lxml\html\clean.py", line 571, in _link_text
host = match.group('host')
IndexError: no such group
Any python/regex gurus out there who know how to make this work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为了使正则表达式适应 lxml 的自动链接,需要做两件事。首先将整个 url 模式匹配包装在一个组
(?P .. )
中 - 这让 lxml 知道href=""
属性内部的内容。接下来,将主机部分包装在
(?.. )
组中,并在调用自动链接函数时传递avoid_hosts=[]
参数。原因是您使用的正则表达式模式并不总是能找到主机(有时host
部分将为None
),因为它匹配部分 url 和不明确的 url类似的图案。我修改了正则表达式以包含上述更改并给出了一个测试用例片段:
输出:
There are two things to do in order to adapt the regexp to lxml's autolink. First wrap the entire url pattern match in a group
(?P<body> .. )
- this lets lxml know what goes inside thehref=""
attribute.Next, wrap the host part in a
(?<host> .. )
group and passavoid_hosts=[]
parameter when you call the autolink function. The reason for this is the regexp pattern you're using doesn't always find a host (sometimes thehost
part will beNone
) since it matches partial urls and ambiguous url-like patterns.I've modified the regexp to include the above changes and given a snippet test case:
Output:
您确实没有提供足够的信息来确定,但我敢打赌您在格鲁伯的正则表达式中遇到了反斜杠的转义问题。尝试使用原始字符串(允许使用反斜杠而无需转义)和三引号(允许您在字符串中使用引号而无需转义它们)。例如
You don't really give enough information to be sure, but I bet that you're having escaping issues with the backslashes in Gruber's regex. Try using a raw string, which allows backslashes without escaping, and triple-quotes, which allow you to use quotes in the string without having to escape those either. E.g.