Python Mechanize 无法正确处理重定向
我正在 Python 中使用 Mechanize 和 Beautiful Soup 开发一个刮刀,由于某种原因重定向不起作用。这是我的代码(我很抱歉将我的变量命名为“thing”和“stuff”;我通常不会这样做,相信我):
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
for thing in stuff:
pageUrl = thing['href']
print pageUrl
req = mechanize.Request(pageUrl)
response = browser.open(req)
searchPage = response.read()
soup = BeautifulSoup(searchPage)
soupString = soup.prettify()
print soupString
无论如何,卡夫网站上具有多个搜索结果页面的产品都会显示一个链接下一页。源代码列出,例如 此 作为卡夫牛排系列的下一页酱汁和腌料,重定向到 这个
无论如何, thing['href']
中有旧链接,因为它会抓取旧链接;人们可能会认为在该链接上执行 browser.open()
会导致 mechanize 转到新链接并将其作为响应返回。然而,运行代码给出了这个结果:
http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out
我超时了;我想这是因为,由于某种原因,mechanize 正在寻找旧的 URL,并且没有被重定向到新的 URL(我也用 urllib2 尝试过这一点并收到了相同的结果)。这是怎么回事?
感谢您的帮助,如果您需要更多信息,请告诉我。
更新:好吧,我启用了日志记录;现在我的代码如下:
req = mechanize.Request(pageUrl)
print logging.INFO
当我运行它时,我得到这个:
url 参数不是 URI(包含非法字符) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid= 1&搜索文本=a.1。牛排酱和腌料&pageno=2' 20
更新2(在编写第一次更新时发生):原来是我的字符串中的空格!我所要做的就是: pageUrl = thing['href'].replace(' ', "+")
它工作得很好。
I'm working on a scraper using Mechanize and Beautiful Soup in Python and for some reason redirects aren't working. Here's my code (I apologize for naming my variables "thing" and "stuff"; I don't normally do that, trust me):
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
for thing in stuff:
pageUrl = thing['href']
print pageUrl
req = mechanize.Request(pageUrl)
response = browser.open(req)
searchPage = response.read()
soup = BeautifulSoup(searchPage)
soupString = soup.prettify()
print soupString
Anyway, products on Kraft's website that have more than one page for search results display a link to go the next page(s). The source code lists, for example, this as the next page for Kraft's line of steak sauces and marinades, which redirects to this
Anyway, thing['href']
has the old link in it because it scrapes the web page for it; one would think that doing browser.open()
on that link would cause mechanize to go to the new link and return that as a response. However, running the code gives this result:
http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out
I get a time-out; I imagine it's because, for some reason, mechanize is looking for the old URL and isn't being redirected to the new one (I also tried this with urllib2 and received the same result). What's going on here?
Thanks for the help and let me know if you need any more information.
Update: Alright, I enabled logging; now my code reads:
req = mechanize.Request(pageUrl)
print logging.INFO
When I run it I get this:
url argument is not a URI (contains illegal characters) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'
20
Update 2 (which occurred while writing the first update): It turns out that it was the spaces in my string! All I had to do was this: pageUrl = thing['href'].replace(' ', "+")
and it works perfectly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
默认情况下,
urllib2
和mechanize
开启器都包含重定向响应的处理程序(您可以查看handlers
属性),所以我不认为问题在于未正确遵循重定向响应。要解决该问题,您应该捕获 Web 浏览器中的流量(在 Firefox 中,实时 HTTP 标头 和 HttpFox 是这样做很有用)并将其与脚本中的日志进行比较(我建议子类化 urllib2.BaseHandler 来创建您自己的处理程序来记录每个请求所需的所有信息并将处理程序添加到使用
add_handler
方法的 opener 对象)。Both
urllib2
andmechanize
openers include a handler for redirect responses by default (you can check looking at thehandlers
attribute), so I don't think the problem is that a redirect response isn't being correctly followed.To troubleshoot the problem, you should capture the traffic in your web browser (in firefox, Live HTTP Headers and HttpFox are useful to do this) and compare it with the logs from your script (I'd recommend subclassing
urllib2.BaseHandler
to create your own handler to log all the information you need for every request and add the handler to your opener object using theadd_handler
method).