urllib2 返回浏览器不同的页面?
我正在尝试抓取一个页面(我的路由器的管理页面),但该设备似乎为 urllib2 提供与我的浏览器不同的页面。以前有人发现过这个吗?我怎样才能绕过它?
这是我正在使用的代码:
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
(html 输出被 markdown 删除)
I'm trying to scrape a page (my router's admin page) but the device seems to be serving a different page to urllib2 than to my browser. has anyone found this before? How can I get around it?
this the code I'm using:
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
(html output is removed by markdown)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用 firebug 观察发送到服务器的标头和 cookie。然后使用 urllib2.Request 和 cookielib 模拟相同的请求。
编辑:您也可以使用 mechanize。
With firebug watch what headers and cookies are sent to server. Then with urllib2.Request and cookielib emulate the same request.
EDIT: Also you can use mechanize.
比 Wireshark 更简单的方法可能是使用 Firebug 查看所发出请求的形式,然后在您的代码。
Simpler than Wireshark may be to use Firebug to see the form of the request being made, and then emulating the same in your code.
使用 Wireshark 查看浏览器的请求,并添加缺少的部分,使您的请求看起来相同。
要调整 urllib2 标头,请尝试这个。
Use Wireshark to see what your browser's request looks like, and add the missing parts so that your request looks the same.
To tweak urllib2 headers, try this.
这可能不起作用,因为您尚未提供管理页面的凭据
使用 mechanize 加载登录页面并填写用户名/密码。
然后您应该设置一个 cookie 以允许您继续访问管理页面。
仅使用 urllib2 会困难得多。如果您选择坚持该路线,则需要自行管理 cookie。
Probably this isn't working because you haven't supplied credentials for the admin page
Use mechanize to load the login page and fill out the username/password.
Then you should have a cookie set to allow you to continue to the admin page.
It is much harder using just urllib2. You will need to manage the cookies yourself if you choose to stick to that route.
就我而言,它是以下之一:
1)网站可以理解访问不是来自浏览器,所以我必须像这样在Python中伪造浏览器:
2)的内容页面由 javascript 动态填充。在这种情况下,请阅读以下帖子:https://stackoverflow.com/a/11460633/2160507
in my case it was one of the following:
1) The website vould understood that the access was not from a browser, so i had to fake a browser in python like that:
2) The content of the page was filled dynamically by javascript. In that case read the following post: https://stackoverflow.com/a/11460633/2160507