BeautifulSoup 解析器将分号附加到裸露的 & 符号上,从而破坏 URL?
我正在尝试解析 python 中的某些网站,其中包含指向其他网站的链接,但以纯文本形式,而不是“a”标记中。使用 BeautifulSoup 我得到了错误的答案。考虑这段代码:
import BeautifulSoup
html = """<html>
<head>
<title>Test html</title>
</head>
<body>
<div>
example.com/a.php?b=2&c=15
</div>
</body>
</html>"""
parsed = BeautifulSoup.BeautifulSoup(html)
print parsed
当我运行上面的代码时,我得到以下输出:
<html>
<head>
<title>Test html</title>
</head>
<body>
<div>
example.com/a.php?b=2&c;=15
</div>
</body>
</html>
注意“div”中的链接和部分 b=2&c;=15。它与原始 HTML 不同。为什么 BeautifulSoup 会以这种方式弄乱链接。它是否试图自动创建 HTML 实体?如何防止这种情况发生?
I am trying to parse some site in python that has links in it to other sites, but in plain text, not in "a" tag. Using BeautifulSoup I get the wrong answer. Consider this code:
import BeautifulSoup
html = """<html>
<head>
<title>Test html</title>
</head>
<body>
<div>
example.com/a.php?b=2&c=15
</div>
</body>
</html>"""
parsed = BeautifulSoup.BeautifulSoup(html)
print parsed
when I run the above code I get the following output:
<html>
<head>
<title>Test html</title>
</head>
<body>
<div>
example.com/a.php?b=2&c;=15
</div>
</body>
</html>
Notice the link in the "div" and the part b=2&c;=15. It's different from the original HTML. Why is BeautifulSoup messing with the links in such a way. Is it trying to automagically create HTML entites? How to prevent this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
显然 BS 有一个记录不足的 解析&符号的问题在 URL 内,我刚刚搜索了他们的论坛“分号”。根据 2009 年的讨论,裸露的
&
严格来说是无效的,必须用&
替换,尽管浏览器接受这一点,所以它看起来很迂腐。我同意这种解析行为是伪造的,您应该联系他们的列表,要求他们至少更好地记录这一已知问题,并在将来修复它。
解决方法:无论如何,您的解决方法很可能是
re.sub(...)
来捕获并扩展&
->&
仅在 URL 内。可能您需要一个反向函数来在输出中压缩它们。您需要一个更高级的正则表达式来仅捕获 URL 中的 & 符号,但无论如何:可能还有其他更废话的方法。
您可能想帮助测试 4.0 beta。
Apparently BS has an underdocumented issue parsing ampersands inside URL, I just searched their discussion forum for 'semicolon'. According to that discussion from 2009, naked
&
is strictly not valid and must be replaced by&
although browsers accept this so it seems waay pedantic.I agree this parsing behavior is bogus, and you should contact their list to ask them to at least document this better as a known issue, and fix it in future.
Workaround: Anyway, your workaround will most likely be
re.sub(...)
to capture and expand&
->&
only inside URLs. Possibly you need a reverse function to compress them in the output. You'll need a fancier regex to capture only ampersands inside URLs, but anyway:There may be other more BS-thonic approaches.
You may want to help test the 4.0 beta.