抓取域列表的登陆页面
我有一个相当长的网站列表,我想下载 登陆(index.html 或同等)页面。我目前正在使用 Scrapy(非常喜欢它背后的人——这是一个很棒的框架)。 Scrapy 在这个特定任务上比我想要的要慢,我想知道考虑到任务的简单性,wget 或其他替代方案是否会更快。有什么想法吗?
(这是我使用 Scrapy 所做的事情。我可以做些什么来优化这个任务的 scrapy 吗?)
所以,我有一个起始 URL 列表,如
start_urls=[google.com 雅虎网站 aol.com]
我从每个响应中抓取文本并将其存储在 xml 中。我需要关闭异地中间件以允许多个域。
Scrapy 按预期工作,但似乎很慢(一小时内大约 1000 个或 1 每 4 秒)。有没有办法通过增加 运行单个进程时 CONCURRENT_REQUESTS_PER_SPIDER 的数量 蜘蛛?还要别的吗?
I have a reasonably long list of websites that I want to download the
landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than I'd like and I am wondering if wget or an other alternative would be faster given how straightforward the task is. Any ideas?
(Here's what I am doing with Scrapy. Anything I can do to optimize scrapy for this task? )
So, I have a start URLs list like
start_urls=[google.com
yahoo.com
aol.com]
And I scrape the text from each response and store this in an xml. I need to turn of the offsitemiddleware to allow for multiple domains.
Scrapy works as expected, but seems slow (About 1000 in an hour or 1
every 4 seconds). Is there a way to speed this up by increasing the
number of CONCURRENT_REQUESTS_PER_SPIDER while running a single
spider? Anything else?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您想要一种使用 python 同时下载多个站点的方法,您可以使用如下标准库来实现:
您还可以查看
httplib2
或PycURL
来进行下载为您而不是urllib
。我不清楚您希望将抓取的文本作为 xml 看起来如何,但您可以使用标准库中的
xml.etree.ElementTree
或者您可以安装BeautifulSoup
(这会更好,因为它可以处理格式错误的标记)。If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:
You could also check out
httplib2
orPycURL
to do the downloading for you instead ofurllib
.I'm not clear exactly how you want the scraped text as xml to look, but you could use
xml.etree.ElementTree
from the standard library or you could installBeautifulSoup
(which would be better as it handles malformed markup).