限制pycurl中的文本下载内容
我想使用 python (pycurl) 中的curl 下载网站内容。但我不希望这些网站的全部文本只是网站的某些部分。我想减少下载全文所花费的时间。谢谢。
I want to download site content using curl in python (pycurl). But I don't want the whole text of those sites just some part of the site. I want to reduce my time taken in downloading the whole text. Thankyou.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您应该设置相关 标题在您的 HTTP 请求中,请参阅此问题了解如何使用
pycurl
注意:此仅在以下情况下有效:
You should set the relevant headers in your HTTP request, see this question on how to do it with
pycurl
NOTE: This only works if you:
一般来说,加载页面的延迟并不在于 HTML 的实际下载——这通常很快,因为 html 无非是 Unicode 文本。除非页面上有大量的实际文本和标记,否则您不会节省太多。此外,为了获取页面的任何实际内容,您无论如何都需要下载整个
...
就个人而言,我会异步处理此问题。 Twisted 是此类方法最常见的建议之一。
The delay in loading a page, generally, is not in the actual download of the HTML -- that's often quite quick as html is nothing more than Unicode text. Unless there is a HUGE amount of actual text and markup on a page you're not going to save much. Further, in order to get any of the actual content of the page, you'll need to download the entire
<head>
anyway...Personally, I would approach this asynchronously. Twisted is one of the more common suggestions for this type of approach.