urllib2 返回浏览器不同的页面?

发布于 2024-09-09 05:33:34 字数 534 浏览 8 评论 0原文

我正在尝试抓取一个页面(我的路由器的管理页面),但该设备似乎为 urllib2 提供与我的浏览器不同的页面。以前有人发现过这个吗?我怎样才能绕过它?

这是我正在使用的代码:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

(html 输出被 markdown 删除)

I'm trying to scrape a page (my router's admin page) but the device seems to be serving a different page to urllib2 than to my browser. has anyone found this before? How can I get around it?

this the code I'm using:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

(html output is removed by markdown)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

霓裳挽歌倾城醉 2024-09-16 05:33:34

使用 firebug 观察发送到服务器的标头和 cookie。然后使用 urllib2.Requestcookielib 模拟相同的请求。

编辑:您也可以使用 mechanize

With firebug watch what headers and cookies are sent to server. Then with urllib2.Request and cookielib emulate the same request.

EDIT: Also you can use mechanize.

风月客 2024-09-16 05:33:34

比 Wireshark 更简单的方法可能是使用 Firebug 查看所发出请求的形式,然后在您的代码。

Simpler than Wireshark may be to use Firebug to see the form of the request being made, and then emulating the same in your code.

请恋爱 2024-09-16 05:33:34

使用 Wireshark 查看浏览器的请求,并添加缺少的部分,使您的请求看起来相同。

要调整 urllib2 标头,请尝试这个

Use Wireshark to see what your browser's request looks like, and add the missing parts so that your request looks the same.

To tweak urllib2 headers, try this.

凉宸 2024-09-16 05:33:34

这可能不起作用,因为您尚未提供管理页面的凭据

使用 mechanize 加载登录页面并填写用户名/密码。

然后您应该设置一个 cookie 以允许您继续访问管理页面。

仅使用 urllib2 会困难得多。如果您选择坚持该路线,则需要自行管理 cookie。

Probably this isn't working because you haven't supplied credentials for the admin page

Use mechanize to load the login page and fill out the username/password.

Then you should have a cookie set to allow you to continue to the admin page.

It is much harder using just urllib2. You will need to manage the cookies yourself if you choose to stick to that route.

双马尾 2024-09-16 05:33:34

就我而言,它是以下之一:

1)网站可以理解访问不是来自浏览器,所以我必须像这样在Python中伪造浏览器

# Build a opener to fake a browser... Google here I come!
opener = urllib2.build_opener()
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
#Read the page
soup = BeautifulSoup(opener.open(url).read())

2)的内容页面由 javascript 动态填充。在这种情况下,请阅读以下帖子:https://stackoverflow.com/a/11460633/2160507

in my case it was one of the following:

1) The website vould understood that the access was not from a browser, so i had to fake a browser in python like that:

# Build a opener to fake a browser... Google here I come!
opener = urllib2.build_opener()
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
#Read the page
soup = BeautifulSoup(opener.open(url).read())

2) The content of the page was filled dynamically by javascript. In that case read the following post: https://stackoverflow.com/a/11460633/2160507

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文