Python:urlopen 未下载整个网站
问候,
我已经做到了:
import urllib
site = urllib.urlopen('http://www.weather.com/weather/today/Temple+TX+76504')
site_data = site.read()
site.close()
但它无法与在 Firefox 中加载时查看源代码相比。
我怀疑用户代理并执行了以下操作:
class AppURLopener(urllib.FancyURLopener):
version = "Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.2.8) Gecko/20100722 Ubuntu/10.04 (lucid) Firefox/3.6.8"
urllib._urlopener = AppURLopener()
并下载了它,但它仍然没有下载整个网站。
如果这可能是罪魁祸首,有人可以帮我进行用户代理切换吗?
谢谢, 纳尼
Greetings,
I have done:
import urllib
site = urllib.urlopen('http://www.weather.com/weather/today/Temple+TX+76504')
site_data = site.read()
site.close()
but it doesn't compare to viewing the source when loaded in firefox.
I suspected the user agent and did this:
class AppURLopener(urllib.FancyURLopener):
version = "Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.2.8) Gecko/20100722 Ubuntu/10.04 (lucid) Firefox/3.6.8"
urllib._urlopener = AppURLopener()
and downloaded it, but it still doesn't download the whole website.
Can someone please help me do user agent switching, if that is the likely culprit?
Thanks,
Narnie
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
更有可能的是代码中存在
iframe
或者 javascript 正在修改 DOM。如果有 iframe,则必须解析页面以获取 iframe 的 url,或者如果是一次性的,则只需手动执行此操作。如果是 javascript,我听说 selenium-rc 很好,但没有第一手经验。It's more likely that there is an
iframe
in the code or that javascript is modifying the DOM. If theres an iframe, you'll have to parse the page to get the url for the iframe or just do it manually if it's a one-off. If it's javascript, I hear that selenium-rc is good but have no first hand experience with it.本地显示的下载页面可能由于多种原因看起来有所不同,例如存在相对链接(可以修复添加例如
进入页面头元素),或非功能性 ajax 请求(请参阅 方法规避同源策略)。downloaded page displayed locally may look different from several reasons, like that there are relative links (can be fixed adding e.g.
<base href="http://www.weather.com/today/">
into the page head element), or non-functional ajax requests (see Ways to circumvent the same-origin policy).