urllib2 返回浏览器不同的页面？

发布于 2024-09-09 05:33:34 字数 534 浏览 8 评论 0原文

我正在尝试抓取一个页面（我的路由器的管理页面），但该设备似乎为 urllib2 提供与我的浏览器不同的页面。以前有人发现过这个吗？我怎样才能绕过它？

这是我正在使用的代码：

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

（html 输出被 markdown 删除）

原文

I'm trying to scrape a page (my router's admin page) but the device seems to be serving a different page to urllib2 than to my browser. has anyone found this before? How can I get around it?

this the code I'm using:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

(html output is removed by markdown)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

霓裳挽歌倾城醉 2024-09-16 05:33:34

使用 firebug 观察发送到服务器的标头和 cookie。然后使用 urllib2.Request 和 cookielib 模拟相同的请求。

编辑：您也可以使用 mechanize。

回复收藏 0 原文

风月客 2024-09-16 05:33:34

比 Wireshark 更简单的方法可能是使用 Firebug 查看所发出请求的形式，然后在您的代码。

回复收藏 0 原文

请恋爱 2024-09-16 05:33:34

使用 Wireshark 查看浏览器的请求，并添加缺少的部分，使您的请求看起来相同。

要调整 urllib2 标头，请尝试这个。

回复收藏 0 原文

凉宸 2024-09-16 05:33:34

这可能不起作用，因为您尚未提供管理页面的凭据

使用 mechanize 加载登录页面并填写用户名/密码。

然后您应该设置一个 cookie 以允许您继续访问管理页面。

仅使用 urllib2 会困难得多。如果您选择坚持该路线，则需要自行管理 cookie。

回复收藏 0 原文

双马尾 2024-09-16 05:33:34

就我而言，它是以下之一：

1）网站可以理解访问不是来自浏览器，所以我必须像这样在Python中伪造浏览器：

# Build a opener to fake a browser... Google here I come!
opener = urllib2.build_opener()
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
#Read the page
soup = BeautifulSoup(opener.open(url).read())

2）的内容页面由 javascript 动态填充。在这种情况下，请阅读以下帖子：https://stackoverflow.com/a/11460633/2160507

in my case it was one of the following:

1) The website vould understood that the access was not from a browser, so i had to fake a browser in python like that:

# Build a opener to fake a browser... Google here I come!
opener = urllib2.build_opener()
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
#Read the page
soup = BeautifulSoup(opener.open(url).read())

2) The content of the page was filled dynamically by javascript. In that case read the following post: https://stackoverflow.com/a/11460633/2160507

回复收藏 0 原文

~没有更多了~