使用Python和Scrapy进行递归爬行

发布于 2024-10-20 12:57:38 字数 1923 浏览 5 评论 0原文

我正在使用 scrapy 来抓取网站。该网站每页有 15 个列表，然后有一个下一步按钮。我遇到了一个问题，在我完成解析管道中的所有列表之前，正在调用我对下一个链接的请求。这是我的蜘蛛的代码：

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    def start_requests(self):
        return [Request(self.start_url, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

            item = il.load_item()
            listing_url = listing.select('...').extract()

            if listing_url:
                yield Request(urlparse.urljoin(response.url, listing_url[0]),
                              meta={'item': item},
                              callback=self.parse_listing_details)

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()
        if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)


    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        return il.load_item()

这些行是问题所在。正如我之前所说，它们是在蜘蛛爬行完当前页面之前执行的。在网站的每个页面上，这只会导致我的列表中的 15 个中只有 3 个被发送到管道。

     if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)

这是我的第一个蜘蛛，可能是我的设计缺陷，有更好的方法吗？

原文

I'm using scrapy to crawl a site. The site has 15 listings per page and then has a next button. I am running into an issue where my Request for the next link is being called before I am finished parsing all of my listings in pipeline. Here is the code for my spider:

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    def start_requests(self):
        return [Request(self.start_url, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

            item = il.load_item()
            listing_url = listing.select('...').extract()

            if listing_url:
                yield Request(urlparse.urljoin(response.url, listing_url[0]),
                              meta={'item': item},
                              callback=self.parse_listing_details)

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()
        if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)


    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        return il.load_item()

These lines are the problem. Like I said before, they are being executed before the spider has finished crawling the current page. On every page of the site, this causes only 3 out 15 of my listings to be sent to the pipeline.

     if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)

This is my first spider and might be a design flaw on my part, is there a better way to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

残月升风 2024-10-27 12:57:38

刮伤而不是蜘蛛？

因为您最初的问题需要重复导航一组连续且重复的内容，而不是未知大小的内容树，所以请使用 mechanize (http://wwwsearch.sourceforge.net/mechanize/) 和 beautifulsoup (http://www .crummy.com/software/BeautifulSoup/）。

下面是使用 mechanize 实例化浏览器的示例。此外，使用 br.follow_link(text="foo") 意味着，与示例中的 xpath 不同，无论祖先路径中元素的结构如何，链接仍将被遵循。这意味着，如果他们更新 HTML，您的脚本就会崩溃。较松的联轴器将为您节省一些维护费用。这是一个示例：

br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)Gecko/20100101 Firefox/9.0.1')]
br.addheaders = [('Accept-Language','en-US')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open("http://amazon.com")
br.follow_link(text="Today's Deals")
print br.response().read()

此外，在“next 15”href 中可能有一些指示分页的内容，例如 &index=15。如果所有页面上的项目总数在第一页上可用，则：

soup = BeautifulSoup(br.response().read())
totalItems = soup.findAll(id="results-count-total")[0].text
startVar =  [x for x in range(int(totalItems)) if x % 15 == 0]

然后迭代 startVar 并创建 url，将 startVar 的值添加到 url，br.open() 并抓取数据。这样，您就不必以编程方式“查找”页面上的“下一个”链接并单击它以前进到下一页 - 您已经知道所有有效的 url。最大限度地减少代码驱动的页面操作，仅处理您需要的数据将加快提取速度。

Scrape instead of spider?

Because your original problem requires the repeated navigation of a consecutive and repeated set of content instead of a tree of content of unknown size, use mechanize (http://wwwsearch.sourceforge.net/mechanize/) and beautifulsoup (http://www.crummy.com/software/BeautifulSoup/).

Here's an example of instantiating a browser using mechanize. Also, using the br.follow_link(text="foo") means that, unlike the xpath in your example, the links will still be followed no matter the structure of the elements in the ancestor path. Meaning, if they update their HTML your script breaks. A looser coupling will save you some maintenance. Here is an example:

br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)Gecko/20100101 Firefox/9.0.1')]
br.addheaders = [('Accept-Language','en-US')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open("http://amazon.com")
br.follow_link(text="Today's Deals")
print br.response().read()

Also, in the "next 15" href there is probably something indicating pagination e.g. &index=15. If the total number of items on all pages is available on the first page, then:

soup = BeautifulSoup(br.response().read())
totalItems = soup.findAll(id="results-count-total")[0].text
startVar =  [x for x in range(int(totalItems)) if x % 15 == 0]

Then just iterate over startVar and create the url, add the value of startVar to the url, br.open() it and scrape the data. That way you don't have to programatically "find" the "next" link on the page and execute a click on it to advance to the next page - you already know all the valid urls. Minimizing code driven manipulation of the page to only the data you need will speed up your extraction.

回复收藏 0 原文

回眸一笑 2024-10-27 12:57:38

有两种方法可以按顺序执行此操作：

通过在类下定义 listing_url 列表。
通过在 parse_listings() 中定义 listing_url 。

唯一的区别是措辞。另外，假设有五个页面需要获取 listing_urls。因此，也将 page=1 放在类下。

在 parse_listings 方法中，仅发出一次请求。将所有数据放入您需要跟踪的meta 中。也就是说，仅使用 parse_listings 来解析“首页”。

到达队列末尾后，归还您的物品。这个过程是连续的。

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    listing_url = []
    page = 1

    def start_requests(self):
        return [Request(self.start_url, meta={'page': page}, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

        items = il.load_item()

        # populate the listing_url with the scraped URLs
        self.listing_url.extend(listing.select('...').extract())

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()

        # now that the front page is done, move on to the next listing_url.pop(0)
        # add the next_page_url to the meta data
        return Request(urlparse.urljoin(response.url, self.listing_url.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': next_page_url},
                            callback=self.parse_listing_details)

    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        items = il.load_item()

        # check to see if you have any more listing_urls to parse and last page
        if self.listing_urls:
            return Request(urlparse.urljoin(response.url, self.listing_urls.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': response.meta['next_page_url']},
                            callback=self.parse_listings_details)
        elif not self.listing_urls and response.meta['page'] != 5:
            # loop back for more URLs to crawl
            return Request(urlparse.urljoin(response.url, response.meta['next_page_url']),
                            meta={'page': self.page + 1, 'items': items},
                            callback=self.parse_listings)
        else:
            # reached the end of the pages to crawl, return data
            return il.load_item()

There are two ways of doing this sequentially:

by defining a listing_url list under the class.
by defining the listing_url inside the parse_listings().

The only difference is verbage. Also, suppose there are five pages to get listing_urls. So put page=1 under class as well.

In the parse_listings method, only make a request once. Put all the data into the meta that you need to keep track of. That being said, use parse_listings only to parse the 'front page'.

Once you reached the end of the line, return your items. This process is sequential.

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    listing_url = []
    page = 1

    def start_requests(self):
        return [Request(self.start_url, meta={'page': page}, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

        items = il.load_item()

        # populate the listing_url with the scraped URLs
        self.listing_url.extend(listing.select('...').extract())

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()

        # now that the front page is done, move on to the next listing_url.pop(0)
        # add the next_page_url to the meta data
        return Request(urlparse.urljoin(response.url, self.listing_url.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': next_page_url},
                            callback=self.parse_listing_details)

    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        items = il.load_item()

        # check to see if you have any more listing_urls to parse and last page
        if self.listing_urls:
            return Request(urlparse.urljoin(response.url, self.listing_urls.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': response.meta['next_page_url']},
                            callback=self.parse_listings_details)
        elif not self.listing_urls and response.meta['page'] != 5:
            # loop back for more URLs to crawl
            return Request(urlparse.urljoin(response.url, response.meta['next_page_url']),
                            meta={'page': self.page + 1, 'items': items},
                            callback=self.parse_listings)
        else:
            # reached the end of the pages to crawl, return data
            return il.load_item()

回复收藏 0 原文

却一份温柔 2024-10-27 12:57:38

您可以根据需要多次生成请求或项目。

def parse_category(self, response):
    # Get links to other categories
    categories = hxs.select('.../@href').extract()

    # First, return CategoryItem
    yield l.load_item()

    for url in categories:
        # Than return request for parse category
        yield Request(url, self.parse_category)

我在这里发现 - https://groups.google.com/d/味精/scrapy-users/tHAAgnuIPR4/0ImtdyIoZKYJ

You can yield requests or items as many times as you need.

def parse_category(self, response):
    # Get links to other categories
    categories = hxs.select('.../@href').extract()

    # First, return CategoryItem
    yield l.load_item()

    for url in categories:
        # Than return request for parse category
        yield Request(url, self.parse_category)

I found that here — https://groups.google.com/d/msg/scrapy-users/tHAAgnuIPR4/0ImtdyIoZKYJ

回复收藏 0 原文

空心↖ 2024-10-27 12:57:38

请参阅下面的“编辑 2”部分（2017 年 10 月 6 日更新）下的更新答案

您使用产量有什么具体原因吗？ Yield 将返回一个生成器，当对其调用 .next() 时，该生成器将返回 Request 对象。

将您的 yield 语句更改为 return 语句，事情应该按预期进行。

以下是生成器的示例：

In [1]: def foo(request):
   ...:     yield 1
   ...:     
   ...:     

In [2]: print foo(None)
<generator object foo at 0x10151c960>

In [3]: foo(None).next()
Out[3]: 1

编辑：

更改 def start_requests(self) 函数以使用 follow 参数。

return [Request(self.start_url, callback=self.parse_listings, follow=True)]

编辑2：

从2017年5月18日发布的Scrapy v1.4.0开始，现在建议使用 response.follow 而不是直接创建 scrapy.Request 对象。

来自发行说明：

有一个新的response.follow方法用于创建请求；现在是
在 Scrapy 蜘蛛中创建请求的推荐方法。这个方法
使编写正确的蜘蛛程序变得更容易； response.follow 有几个
相对于直接创建 scrapy.Request 对象的优点：
它处理相对 URL；
它可以与非 UTF8 页面上的非 ASCII URL 正常工作；
除了绝对和相对 URL 之外，它还支持选择器；对于元素，它还可以提取它们的 href 值。

因此，对于上面的OP，将代码从：更改

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href').extract()
    if next_page_url:
        yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                      callback=self.parse_listings)

为：

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href')
    if next_page_url is not None:
        yield response.follow(next_page_url, self.parse_listings)

See below for an updated answer, under the EDIT 2 section (updated October 6th, 2017)

Is there any specific reason that you're using yield? Yield will return a generator, which will return the Request object when .next() is invoked on it.

Change your yield statements to return statements and things should work as expected.

Here's an example of a generator:

In [1]: def foo(request):
   ...:     yield 1
   ...:     
   ...:     

In [2]: print foo(None)
<generator object foo at 0x10151c960>

In [3]: foo(None).next()
Out[3]: 1

EDIT:

Change your def start_requests(self) function to use the follow parameter.

return [Request(self.start_url, callback=self.parse_listings, follow=True)]

EDIT 2:

As of Scrapy v1.4.0, released on 2017-05-18, it is now recommended to use response.follow instead of creating scrapy.Request objects directly.

From the release notes:

There’s a new response.follow method for creating requests; it is now
a recommended way to create Requests in Scrapy spiders. This method
makes it easier to write correct spiders; response.follow has several
advantages over creating scrapy.Request objects directly:
it handles relative URLs;
it works properly with non-ascii URLs on non-UTF8 pages;
in addition to absolute and relative URLs it supports Selectors; for elements it can also extract their href values.

So, for the OP above, change the code from:

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href').extract()
    if next_page_url:
        yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                      callback=self.parse_listings)

to:

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href')
    if next_page_url is not None:
        yield response.follow(next_page_url, self.parse_listings)

回复收藏 0 原文

冷情 2024-10-27 12:57:38

您可能想研究两件事。

您正在抓取的网站可能会阻止您定义的用户代理。
尝试向您的蜘蛛添加 DOWNLOAD_DELAY。

回复收藏 0 原文

静赏你的温柔 2024-10-27 12:57:38

我刚刚在我的代码中解决了同样的问题。我使用了 Python 2.7 中的 SQLite3 数据库来修复它：您正在收集的每个项目的信息都会在解析函数的第一遍中将其唯一的行放入数据库表中，并且解析回调的每个实例都会添加每个将项目的数据添加到该项目的表和行中。保留一个实例计数器，以便最后一个回调解析例程知道它是最后一个，并从数据库或其他内容写入 CSV 文件。回调可以是递归的，在元中告知它被分派使用哪个解析模式（当然还有哪个项目）。对我来说就像一个魅力。如果你有Python，你就有SQLite3。这是我第一次发现 scrapy 在这方面的局限性时的帖子：
Scrapy 的异步性是否阻碍了我直接创建 CSV 结果文件？

回复收藏 0 原文