避免重定向

发布于 2024-12-06 08:39:15 字数 1287 浏览 0 评论 0原文

我正在尝试解析一个站点(用 ASP 编写),爬虫程序被重定向到主站点。但我想做的是解析给定的网址,而不是重定向的网址。有办法做到这一点吗?我尝试将“REDIRECT=False”添加到settings.py 文件中,但没有成功。

以下是爬虫的一些输出:

2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=500&id=500>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=1513&id=1513>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=476&id=476>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=472&id=472>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=457&id=457>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=1097&id=1097>

I'm trying to parse a site(written in ASP) and the crawler gets redirected to the main site. But what I'd like to do is to parse the given url, not the redirected one. Is there a way to do this?. I tried adding "REDIRECT=False" to the settings.py file without success.

Here's some output from the crawler:

2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=500&id=500>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=1513&id=1513>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=476&id=476>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=472&id=472>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=457&id=457>
2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=1097&id=1097>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

满地尘埃落定 2024-12-13 08:39:15

http://www.cotodigital.com.ar/l.asp?cat=1097&id=1097 重定向到 http://www.cotodigital.com.ar/default.asp 因为 HTTP 响应是这么说的。发生这种情况是因为 asp 代码正在检查某些条件 - 错误的页面、cookie、用户代理或引荐来源网址。检查上述条件。

更新:
刚刚在我的浏览器中检查:浏览器也被重定向到主页,我在其中单击“跳过广告”。之后就可以正常工作了。

这意味着它会设置一些 cookie,如果没有这些 cookie,它会重定向到主页。

另请参阅 Scrapy - 如何管理 cookie/会话

http://www.cotodigital.com.ar/l.asp?cat=1097&id=1097 redirects to http://www.cotodigital.com.ar/default.asp because HTTP response said to so. This happens because asp code is checking for some condition - a wrong page, or cookies, or user-agent, or referrer. Check the mentioned conditions.

UPDATE:
Just checked in my browser: the browser is also redirected to the main page, where i click 'Skip ads'. After that it works OK.

This means it sets some cookies, without which it redirects to the main page.

See also Scrapy - how to manage cookies/sessions

倾`听者〃 2024-12-13 08:39:15

原始 URL 没有任何内容可以抓取。它返回 302,意味着没有正文,Location 标头指示重定向到的位置。您需要弄清楚如何在不被重定向的情况下访问 URL,也许是通过身份验证。

The original URL has nothing to scrape. It returned 302, meaning there is no body, and the Location header indicates where to redirect to. You need to figure out how to access the URL without being redirected, perhaps by authenticating.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文