Mechanicalize 响应不返回任何内容
我正在 Python 中使用 Mechanize 来执行一些网络抓取。大多数网站都可以运行,但某一特定页面不会返回任何内容或响应。
我的设置是
self._browser = mechanize.Browser()
self._browser.set_handle_refresh(True)
self._browser.set_debug_responses(True)
self._browser.set_debug_redirects(True)
self._browser.set_debug_http(True)
,要执行的代码是:
response = self._browser.open(url)
这是调试输出:
add_cookie_header
Checking xyz.com for cookies to return
- checking cookie path=/
- checking cookie <Cookie ASP.NET_SessionId=j3pg0wnavh3yjseyj1v3mr45 for xyz.com/>
it's a match
send: 'GET /page.aspx?leagueID=39 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: xyz.com\r\nCookie: ASP.NET_SessionId=aapg9wnavh3yqyrtg1v3ar45\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 07 Feb 2012 19:04:37 GMT
header: Pragma: no-cache
header: Expires: -1
header: Connection: close
header: Cache-Control: no-cache
header: Content-Length: 0
extract_cookies: Date: Tue, 07 Feb 2012 19:04:37 GMT
Pragma: no-cache
Expires: -1
Connection: close
Cache-Control: no-cache
Content-Length: 0
我尝试过使用和不使用重定向,但均无济于事。有什么想法吗?
我可能会添加该页面在浏览器中运行良好。
I'm using Mechanize in Python to perform some web scraping. Most of the website works but one particular page doesn't return any Content or Response.
My settings are
self._browser = mechanize.Browser()
self._browser.set_handle_refresh(True)
self._browser.set_debug_responses(True)
self._browser.set_debug_redirects(True)
self._browser.set_debug_http(True)
and the code to execute is:
response = self._browser.open(url)
This is the debug output:
add_cookie_header
Checking xyz.com for cookies to return
- checking cookie path=/
- checking cookie <Cookie ASP.NET_SessionId=j3pg0wnavh3yjseyj1v3mr45 for xyz.com/>
it's a match
send: 'GET /page.aspx?leagueID=39 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: xyz.com\r\nCookie: ASP.NET_SessionId=aapg9wnavh3yqyrtg1v3ar45\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 07 Feb 2012 19:04:37 GMT
header: Pragma: no-cache
header: Expires: -1
header: Connection: close
header: Cache-Control: no-cache
header: Content-Length: 0
extract_cookies: Date: Tue, 07 Feb 2012 19:04:37 GMT
Pragma: no-cache
Expires: -1
Connection: close
Cache-Control: no-cache
Content-Length: 0
I've tried with and without Redirect to no avail. Any ideas?
I might add the page works fine in a browser.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
找出问题所在的过程通常是这样的:
对于第一步,有许多可用的工具。例如,在 Firefox 中,HttpFox 和 实时 HTTP 标头 可能非常有用。
对于第二步,以编程方式记录发送/接收的标头应该足够了。
对于这两个步骤,您还可以使用 wireshark 之类的工具捕获网卡中的流量。
The procedure to find out what's the problem usually is this one:
For the first step, there are many tools available. For example, in firefox, HttpFox and Live HTTP Headers might be quite useful.
For the second step, programmatically logging the headers being sent/received should be enough.
For both steps, you can also capture traffic in your network card with something like wireshark.