抓取域列表的登陆页面

发布于 2024-08-26 09:56:50 字数 471 浏览 9 评论 0原文

我有一个相当长的网站列表，我想下载登陆（index.html 或同等）页面。我目前正在使用 Scrapy（非常喜欢它背后的人——这是一个很棒的框架）。 Scrapy 在这个特定任务上比我想要的要慢，我想知道考虑到任务的简单性，wget 或其他替代方案是否会更快。有什么想法吗？

（这是我使用 Scrapy 所做的事情。我可以做些什么来优化这个任务的 scrapy 吗？）

所以，我有一个起始 URL 列表，如

start_urls=[google.com 雅虎网站 aol.com]

我从每个响应中抓取文本并将其存储在 xml 中。我需要关闭异地中间件以允许多个域。

Scrapy 按预期工作，但似乎很慢（一小时内大约 1000 个或 1 每 4 秒）。有没有办法通过增加运行单个进程时 CONCURRENT_REQUESTS_PER_SPIDER 的数量蜘蛛？还要别的吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

琉璃繁缕 2024-09-02 09:56:53

如果您想要一种使用 python 同时下载多个站点的方法，您可以使用如下标准库来实现：

import threading
import urllib

maxthreads = 4

sites = ['google.com', 'yahoo.com', ] # etc.

class Download(threading.Thread):
   def run (self):
       global sites
       while sites:
           site = sites.pop()
           print "start", site
           urllib.urlretrieve('http://' + site, site)
           print "end  ", site

for x in xrange(min(maxthreads, len(sites))):
    Download().start()

您还可以查看 httplib2 或 PycURL 来进行下载为您而不是 urllib。

我不清楚您希望将抓取的文本作为 xml 看起来如何，但您可以使用标准库中的 xml.etree.ElementTree 或者您可以安装 BeautifulSoup （这会更好，因为它可以处理格式错误的标记）。

If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:

import threading
import urllib

maxthreads = 4

sites = ['google.com', 'yahoo.com', ] # etc.

class Download(threading.Thread):
   def run (self):
       global sites
       while sites:
           site = sites.pop()
           print "start", site
           urllib.urlretrieve('http://' + site, site)
           print "end  ", site

for x in xrange(min(maxthreads, len(sites))):
    Download().start()

You could also check out httplib2 or PycURL to do the downloading for you instead of urllib.

I'm not clear exactly how you want the scraped text as xml to look, but you could use xml.etree.ElementTree from the standard library or you could install BeautifulSoup (which would be better as it handles malformed markup).

回复收藏 0 原文

~没有更多了~