使用 scrapy 抓取 yahoo 群组时出现问题

发布于 2024-10-15 17:28:18 字数 3231 浏览 3 评论 0原文

我是网络抓取新手，刚刚开始尝试 Scrapy，这是一个用 Python 编写的抓取框架。我的目标是清理旧的雅虎集团，因为他们不提供 API 或任何其他方式来检索消息档案。雅虎组设置为您必须登录才能查看档案。

我认为我需要完成的步骤是：

登录雅虎
访问第一条消息的 URL 并抓取它
对下一条消息重复步骤 2，等等

我开始粗略地设计一个 scrapy 蜘蛛来完成上述操作，这就是我到目前为止。我想要观察的是登录正常并且我能够检索第一条消息。一旦我完成了这么多工作，我将完成剩下的工作：

class Sg101Spider(BaseSpider):
    name = "sg101"
    msg_id = 1              # current message to retrieve
    max_msg_id = 21399      # last message to retrieve

    def start_requests(self):
        return [FormRequest(LOGIN_URL,
            formdata={'login': LOGIN, 'passwd': PASSWORD},
            callback=self.logged_in)]

    def logged_in(self, response):
        if response.url == 'http://my.yahoo.com':
            self.log("Successfully logged in. Now requesting 1st message.")
            return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                    errback=self.error)
        else:
            self.log("Login failed.")

    def parse_msg(self, response):
        self.log("Got message!")
        print response.body

    def error(self, failure):
        self.log("I haz an error")

不过，当我运行蜘蛛时，我看到它登录并发出第一条消息的请求。然而，我在 scrapy 的调试输出中看到的只是 3 个重定向，最终到达我最初要求的 URL。但是 scrapy 没有调用我的 parse_msg() 回调，并且爬行停止。这是 scrapy 输出的片段：

2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login>
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None)
2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message.
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished)
2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)

我无法理解这一点。看起来雅虎正在重定向蜘蛛（也许是为了身份验证？），但它似乎回到了我最初想要访问的 URL。但是 scrapy 没有调用我的回调，我没有机会抓取数据或继续爬行。

有谁对正在发生的事情和/或如何进一步调试有任何想法吗？谢谢！

原文

I'm new to web scraping and just started experimenting with Scrapy, a scraping framework written in Python. My goal is to scrape an old Yahoo Group since they don't provide an API or any other means to retrieve message archives. The Yahoo Group is set such that you have to log in before you can view the archives.

The steps I need to accomplish, I think, are:

Log into yahoo
Visit the URL for the first message and scrape it
Repeat step 2 for the next message, etc

I started roughing out a scrapy spider to accomplish the above, and here is what I have so far. All I want to observe is that the login works and I am able to retrieve the first message. I'll finish the rest once I get this much working:

class Sg101Spider(BaseSpider):
    name = "sg101"
    msg_id = 1              # current message to retrieve
    max_msg_id = 21399      # last message to retrieve

    def start_requests(self):
        return [FormRequest(LOGIN_URL,
            formdata={'login': LOGIN, 'passwd': PASSWORD},
            callback=self.logged_in)]

    def logged_in(self, response):
        if response.url == 'http://my.yahoo.com':
            self.log("Successfully logged in. Now requesting 1st message.")
            return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                    errback=self.error)
        else:
            self.log("Login failed.")

    def parse_msg(self, response):
        self.log("Got message!")
        print response.body

    def error(self, failure):
        self.log("I haz an error")

When I run the spider though, I see it login and issue the request for the first message. However, all I see in the debug output from scrapy is 3 redirects, eventually arriving at the URL I asked for in the first place. But scrapy does not call my parse_msg() callback, and the crawling stops. Here is a snippet of the scrapy output:

2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login>
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None)
2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message.
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished)
2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)

I am unable to make sense of this. It looks like Yahoo is redirecting the spider (maybe for auth checking?) but it seems to arrive back at the URL I wanted to visit in the first place. But scrapy doesn't call my callback and I don't get a chance to scrape the data or continue crawling.

Does anyone have any ideas on what is happening and/or how to debug this further? Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

治碍 2024-10-22 17:28:18

我认为雅虎正在重定向进行授权检查，它最终将我重定向回我真正想要获得的页面。然而，Scrapy 已经看到了这个请求，并停止了，因为它不想进入循环。就我而言，解决方案是将 dont_filter=True 添加到 Request 构造函数中。这将指示 Scrapy 不要过滤掉重复的请求。这对我来说没问题，因为我提前知道我想要抓取哪些 URL。

def logged_in(self, response):
    if response.url == 'http://my.yahoo.com':
        self.log("Successfully logged in. Now requesting message page.",
                level=log.INFO)
        return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                errback=self.error, dont_filter=True)
    else:
        self.log("Login failed.", level=log.CRITICAL)

I think Yahoo is redirecting for an authorization check, and it finally redirects me back to the page I really wanted to get. Scrapy has already seen this request, however, and stops because it doesn't want to get into a loop. The solution, in my case, is to add dont_filter=True to the Request constructor. This will instruct Scrapy to not filter out duplicate requests. This is fine in my case, because I know in advance what URLs I want to crawl.

def logged_in(self, response):
    if response.url == 'http://my.yahoo.com':
        self.log("Successfully logged in. Now requesting message page.",
                level=log.INFO)
        return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                errback=self.error, dont_filter=True)
    else:
        self.log("Login failed.", level=log.CRITICAL)

回复收藏 0 原文

~没有更多了~