我可以使用 lxml 仅下载 Internet 网页的部分内容吗?
我不确定这是否可行,并且 lxml 文档对我来说不是很好。
例如,我可以使用类似:
import lxml.html as lx
x = lx.parse('http://web.info/page.html')
y = x.xpath('\\something\interesting'[2])
或类似的内容,这样我就不会下载整个页面吗?
如果不使用lxml,是否有一些Python模块可以做到这一点?
I'm not sure if this is possible and lxml documentation is not very good to me.
Can I for example use something like:
import lxml.html as lx
x = lx.parse('http://web.info/page.html')
y = x.xpath('\\something\interesting'[2])
or similar, so that I don't download whole page?
If not with lxml is there some Python module that can do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以尝试 增量解析:
产量
You could try incremental parsing:
yields
否:
lxml
必须先解析整个页面,然后才能保证找到其中的单个位,并且要解析整个页面,显然必须下载整个页面。 (但另请参阅 unutbu 的答案,了解潜在的部分下载/解析方法。)虽然我相信可以对文件的一部分发出 HTTP 请求(我认为通过
range
标头?),但这并不能保证在服务器端得到支持。令人遗憾的是,HTTP 不包含将 XPath 查询与页面请求一起发送到服务器的方法,并且在发回的页面上运行该查询的结果。
No:
lxml
has to parse the whole page before it can be guaranteed to find an individual bit of it, and to parse it the whole page, it obviously has to download the whole page. (But see also unutbu’s answer for a potential partial downloading/parsing approach.)And although I believe one can make HTTP requests for part of a file (I think via the
range
header?), that’s not guaranteed to be supported on the server side.It’s a shame that HTTP doesn’t include a method for sending an XPath query to the server along with the page request, and have the results of running that query on the page sent back.