Python Mechanize 无法正确处理重定向

发布于 2024-12-20 03:16:13 字数 2273 浏览 0 评论 0原文

我正在 Python 中使用 Mechanize 和 Beautiful Soup 开发一个刮刀,由于某种原因重定向不起作用。这是我的代码(我很抱歉将我的变量命名为“thing”和“stuff”;我通常不会这样做,相信我):

stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
    for thing in stuff:
        pageUrl = thing['href']
        print pageUrl

        req = mechanize.Request(pageUrl)

        response = browser.open(req)

        searchPage = response.read()

        soup = BeautifulSoup(searchPage)
        soupString = soup.prettify()
        print soupString

无论如何,卡夫网站上具有多个搜索结果页面的产品都会显示一个链接下一页。源代码列出,例如 作为卡夫牛排系列的下一页酱汁和腌料,重定向到 这个

无论如何, thing['href'] 中有旧链接,因为它会抓取旧链接;人们可能会认为在该链接上执行 browser.open() 会导致 mechanize 转到新链接并将其作为响应返回。然而,运行代码给出了这个结果:

http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out

我超时了;我想这是因为,由于某种原因,mechanize 正在寻找旧的 URL,并且没有被重定向到新的 URL(我也用 urllib2 尝试过这一点并收到了相同的结果)。这是怎么回事?

感谢您的帮助,如果您需要更多信息,请告诉我。

更新:好吧,我启用了日志记录;现在我的代码如下:

req = mechanize.Request(pageUrl)
print logging.INFO

当我运行它时,我得到这个:

url 参数不是 URI(包含非法字符) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid= 1&搜索文本=a.1。牛排酱和腌料&pageno=2' 20

更新2(在编写第一次更新时发生):原来是我的字符串中的空格!我所要做的就是: pageUrl = thing['href'].replace(' ', "+") 它工作得很好。

I'm working on a scraper using Mechanize and Beautiful Soup in Python and for some reason redirects aren't working. Here's my code (I apologize for naming my variables "thing" and "stuff"; I don't normally do that, trust me):

stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
    for thing in stuff:
        pageUrl = thing['href']
        print pageUrl

        req = mechanize.Request(pageUrl)

        response = browser.open(req)

        searchPage = response.read()

        soup = BeautifulSoup(searchPage)
        soupString = soup.prettify()
        print soupString

Anyway, products on Kraft's website that have more than one page for search results display a link to go the next page(s). The source code lists, for example, this as the next page for Kraft's line of steak sauces and marinades, which redirects to this

Anyway, thing['href'] has the old link in it because it scrapes the web page for it; one would think that doing browser.open() on that link would cause mechanize to go to the new link and return that as a response. However, running the code gives this result:

http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out

I get a time-out; I imagine it's because, for some reason, mechanize is looking for the old URL and isn't being redirected to the new one (I also tried this with urllib2 and received the same result). What's going on here?

Thanks for the help and let me know if you need any more information.

Update: Alright, I enabled logging; now my code reads:

req = mechanize.Request(pageUrl)
print logging.INFO

When I run it I get this:

url argument is not a URI (contains illegal characters) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'
20

Update 2 (which occurred while writing the first update): It turns out that it was the spaces in my string! All I had to do was this: pageUrl = thing['href'].replace(' ', "+") and it works perfectly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

萤火眠眠 2024-12-27 03:16:13

默认情况下,urllib2mechanize 开启器都包含重定向响应的处理程序(您可以查看 handlers 属性),所以我不认为问题在于未正确遵循重定向响应。

要解决该问题,您应该捕获 Web 浏览器中的流量(在 Firefox 中,实时 HTTP 标头HttpFox 是这样做很有用)并将其与脚本中的日志进行比较(我建议子类化 urllib2.BaseHandler 来创建您自己的处理程序来记录每个请求所需的所有信息并将处理程序添加到使用 add_handler 方法的 opener 对象)。

Both urllib2 and mechanize openers include a handler for redirect responses by default (you can check looking at the handlers attribute), so I don't think the problem is that a redirect response isn't being correctly followed.

To troubleshoot the problem, you should capture the traffic in your web browser (in firefox, Live HTTP Headers and HttpFox are useful to do this) and compare it with the logs from your script (I'd recommend subclassing urllib2.BaseHandler to create your own handler to log all the information you need for every request and add the handler to your opener object using the add_handler method).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文