使用Python和Scrapy进行递归爬行
我正在使用 scrapy 来抓取网站。该网站每页有 15 个列表,然后有一个下一步按钮。我遇到了一个问题,在我完成解析管道中的所有列表之前,正在调用我对下一个链接的请求。这是我的蜘蛛的代码:
class MySpider(CrawlSpider):
name = 'mysite.com'
allowed_domains = ['mysite.com']
start_url = 'http://www.mysite.com/'
def start_requests(self):
return [Request(self.start_url, callback=self.parse_listings)]
def parse_listings(self, response):
hxs = HtmlXPathSelector(response)
listings = hxs.select('...')
for listing in listings:
il = MySiteLoader(selector=listing)
il.add_xpath('Title', '...')
il.add_xpath('Link', '...')
item = il.load_item()
listing_url = listing.select('...').extract()
if listing_url:
yield Request(urlparse.urljoin(response.url, listing_url[0]),
meta={'item': item},
callback=self.parse_listing_details)
next_page_url = hxs.select('descendant::div[@id="pagination"]/'
'div[@class="next-link"]/a/@href').extract()
if next_page_url:
yield Request(urlparse.urljoin(response.url, next_page_url[0]),
callback=self.parse_listings)
def parse_listing_details(self, response):
hxs = HtmlXPathSelector(response)
item = response.request.meta['item']
details = hxs.select('...')
il = MySiteLoader(selector=details, item=item)
il.add_xpath('Posted_on_Date', '...')
il.add_xpath('Description', '...')
return il.load_item()
这些行是问题所在。正如我之前所说,它们是在蜘蛛爬行完当前页面之前执行的。在网站的每个页面上,这只会导致我的列表中的 15 个中只有 3 个被发送到管道。
if next_page_url:
yield Request(urlparse.urljoin(response.url, next_page_url[0]),
callback=self.parse_listings)
这是我的第一个蜘蛛,可能是我的设计缺陷,有更好的方法吗?
I'm using scrapy to crawl a site. The site has 15 listings per page and then has a next button. I am running into an issue where my Request for the next link is being called before I am finished parsing all of my listings in pipeline. Here is the code for my spider:
class MySpider(CrawlSpider):
name = 'mysite.com'
allowed_domains = ['mysite.com']
start_url = 'http://www.mysite.com/'
def start_requests(self):
return [Request(self.start_url, callback=self.parse_listings)]
def parse_listings(self, response):
hxs = HtmlXPathSelector(response)
listings = hxs.select('...')
for listing in listings:
il = MySiteLoader(selector=listing)
il.add_xpath('Title', '...')
il.add_xpath('Link', '...')
item = il.load_item()
listing_url = listing.select('...').extract()
if listing_url:
yield Request(urlparse.urljoin(response.url, listing_url[0]),
meta={'item': item},
callback=self.parse_listing_details)
next_page_url = hxs.select('descendant::div[@id="pagination"]/'
'div[@class="next-link"]/a/@href').extract()
if next_page_url:
yield Request(urlparse.urljoin(response.url, next_page_url[0]),
callback=self.parse_listings)
def parse_listing_details(self, response):
hxs = HtmlXPathSelector(response)
item = response.request.meta['item']
details = hxs.select('...')
il = MySiteLoader(selector=details, item=item)
il.add_xpath('Posted_on_Date', '...')
il.add_xpath('Description', '...')
return il.load_item()
These lines are the problem. Like I said before, they are being executed before the spider has finished crawling the current page. On every page of the site, this causes only 3 out 15 of my listings to be sent to the pipeline.
if next_page_url:
yield Request(urlparse.urljoin(response.url, next_page_url[0]),
callback=self.parse_listings)
This is my first spider and might be a design flaw on my part, is there a better way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
刮伤而不是蜘蛛?
因为您最初的问题需要重复导航一组连续且重复的内容,而不是未知大小的内容树,所以请使用 mechanize (http://wwwsearch.sourceforge.net/mechanize/) 和 beautifulsoup (http://www .crummy.com/software/BeautifulSoup/)。
下面是使用 mechanize 实例化浏览器的示例。此外,使用 br.follow_link(text="foo") 意味着,与示例中的 xpath 不同,无论祖先路径中元素的结构如何,链接仍将被遵循。这意味着,如果他们更新 HTML,您的脚本就会崩溃。较松的联轴器将为您节省一些维护费用。这是一个示例:
此外,在“next 15”href 中可能有一些指示分页的内容,例如 &index=15。如果所有页面上的项目总数在第一页上可用,则:
然后迭代 startVar 并创建 url,将 startVar 的值添加到 url,br.open() 并抓取数据。这样,您就不必以编程方式“查找”页面上的“下一个”链接并单击它以前进到下一页 - 您已经知道所有有效的 url。最大限度地减少代码驱动的页面操作,仅处理您需要的数据将加快提取速度。
Scrape instead of spider?
Because your original problem requires the repeated navigation of a consecutive and repeated set of content instead of a tree of content of unknown size, use mechanize (http://wwwsearch.sourceforge.net/mechanize/) and beautifulsoup (http://www.crummy.com/software/BeautifulSoup/).
Here's an example of instantiating a browser using mechanize. Also, using the br.follow_link(text="foo") means that, unlike the xpath in your example, the links will still be followed no matter the structure of the elements in the ancestor path. Meaning, if they update their HTML your script breaks. A looser coupling will save you some maintenance. Here is an example:
Also, in the "next 15" href there is probably something indicating pagination e.g. &index=15. If the total number of items on all pages is available on the first page, then:
Then just iterate over startVar and create the url, add the value of startVar to the url, br.open() it and scrape the data. That way you don't have to programatically "find" the "next" link on the page and execute a click on it to advance to the next page - you already know all the valid urls. Minimizing code driven manipulation of the page to only the data you need will speed up your extraction.
有两种方法可以按顺序执行此操作:
listing_url
列表。parse_listings()
中定义listing_url
。唯一的区别是措辞。另外,假设有五个页面需要获取
listing_urls
。因此,也将page=1
放在类下。在
parse_listings
方法中,仅发出一次请求。将所有数据放入您需要跟踪的meta
中。也就是说,仅使用parse_listings
来解析“首页”。到达队列末尾后,归还您的物品。这个过程是连续的。
There are two ways of doing this sequentially:
listing_url
list under the class.listing_url
inside theparse_listings()
.The only difference is verbage. Also, suppose there are five pages to get
listing_urls
. So putpage=1
under class as well.In the
parse_listings
method, only make a request once. Put all the data into themeta
that you need to keep track of. That being said, useparse_listings
only to parse the 'front page'.Once you reached the end of the line, return your items. This process is sequential.
您可以根据需要多次生成请求或项目。
我在这里发现 - https://groups.google.com/d/味精/scrapy-users/tHAAgnuIPR4/0ImtdyIoZKYJ
You can yield requests or items as many times as you need.
I found that here — https://groups.google.com/d/msg/scrapy-users/tHAAgnuIPR4/0ImtdyIoZKYJ
请参阅下面的“编辑 2”部分(2017 年 10 月 6 日更新)下的更新答案
您使用产量有什么具体原因吗? Yield 将返回一个生成器,当对其调用
.next()
时,该生成器将返回 Request 对象。将您的
yield
语句更改为return
语句,事情应该按预期进行。以下是生成器的示例:
编辑:
更改
def start_requests(self)
函数以使用follow
参数。编辑2:
从2017年5月18日发布的Scrapy v1.4.0开始,现在建议使用
response.follow
而不是直接创建scrapy.Request
对象。来自发行说明:
因此,对于上面的OP,将代码从: 更改
为:
See below for an updated answer, under the EDIT 2 section (updated October 6th, 2017)
Is there any specific reason that you're using yield? Yield will return a generator, which will return the Request object when
.next()
is invoked on it.Change your
yield
statements toreturn
statements and things should work as expected.Here's an example of a generator:
EDIT:
Change your
def start_requests(self)
function to use thefollow
parameter.EDIT 2:
As of Scrapy v1.4.0, released on 2017-05-18, it is now recommended to use
response.follow
instead of creatingscrapy.Request
objects directly.From the release notes:
So, for the OP above, change the code from:
to:
您可能想研究两件事。
You might want to look into two things.
我刚刚在我的代码中解决了同样的问题。我使用了 Python 2.7 中的 SQLite3 数据库来修复它:您正在收集的每个项目的信息都会在解析函数的第一遍中将其唯一的行放入数据库表中,并且解析回调的每个实例都会添加每个将项目的数据添加到该项目的表和行中。保留一个实例计数器,以便最后一个回调解析例程知道它是最后一个,并从数据库或其他内容写入 CSV 文件。回调可以是递归的,在元中告知它被分派使用哪个解析模式(当然还有哪个项目)。对我来说就像一个魅力。如果你有Python,你就有SQLite3。这是我第一次发现 scrapy 在这方面的局限性时的帖子:
Scrapy 的异步性是否阻碍了我直接创建 CSV 结果文件?
I just fixed this same problem in my code. I used the SQLite3 database that comes as part of Python 2.7 to fix it: Each item you are collecting info about gets its unique line put into a database table in the first pass of the parse function, and each instance of the parse callback adds each item's data to the table and line for that item. Keep a instance counter so that the last callback parse routine knows that it is last one, and writes the CSV file from the database or whatever. The callback can be recursive, being told in meta which parse schema (and of course which item) it was dispatched to work with. Works for me like a charm. You have SQLite3 if you have Python. Here was my post when I first discovered scrapy's limitation in this regard:
Is Scrapy's asynchronicity what is hindering my CSV results file from being created straightforwardly?
http://autopython.blogspot.com/2014/04/recursive -scraping-using- different.html
这个示例展示了如何使用不同的技术从网站上删除多个下一页
http://autopython.blogspot.com/2014/04/recursive-scraping-using-different.html
this example show how to scrap mulitple next pages from a website using different techniques