如何下载和读取带有通用换行符的 URL？

发布于 2024-12-17 06:25:45 字数 233 浏览 4 评论 0原文

我将 urllib.urlopen 与 Python 2.7 一起使用，但我需要处理下载的 HTML 文档及其包含的换行符（在

 元素内）。

urllib 文档表明 urlopen 不会使用通用换行符。我该怎么做？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

网名女生简单气质 2024-12-24 06:25:45

除非 HTML 文件已经在您的磁盘上，否则 urlopen() 将正确处理所有格式的换行符（\n、\r\n 和\r) 在您想要解析的 HTML 文件中（即将它们转换为 \n），根据 urllib 文档：

“如果 URL 没有方案标识符，或者具有 file: 作为其方案标识符，则会打开本地文件（没有通用换行符）”

例如

>>> from urllib import urlopen
>>> urlopen("http://****.com/win_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
>>> urlopen("http://****.com/unix_new_lines.htm").read()   
'line 1\nline 2\n\n\nline 3'

Unless the HTML file is already on your disk, urlopen() will handle correctly all formats of newlines (\n, \r\n and \r) in the HTML file you want to parse (that is it will convert them to \n), according to the urllib docs:

"If the URL does not have a scheme identifier, or if it has file: as its scheme identifier, this opens a local file (without universal newlines)"

E.g.

>>> from urllib import urlopen
>>> urlopen("http://****.com/win_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
>>> urlopen("http://****.com/unix_new_lines.htm").read()   
'line 1\nline 2\n\n\nline 3'

回复收藏 0 原文