使用 scrapy 抓取多个域的最佳方法是什么?
我希望从中刮掉大约10个奇数网站。其中一些是WordPress博客,尽管有不同的类别,但它们遵循相同的HTML结构。其他是其他格式的论坛或博客。
我喜欢刮擦的信息很常见 - 帖子内容,时间戳,作者,标题和评论。
我的问题是,我必须为每个域创建一个单独的蜘蛛吗?如果没有,如何创建一个通用蜘蛛,该通用蜘蛛可以通过从配置文件中加载选项或类似内容来刮擦?
我认为我可以从一个可以通过命令行加载的位置加载xpath表达式,但是在刮擦某些域时似乎遇到了一些困难代码>虽然有些没有。
I have around 10 odd sites that I wish to scrape from. A couple of them are wordpress blogs and they follow the same html structure, albeit with different classes. The others are either forums or blogs of other formats.
The information I like to scrape is common - the post content, the timestamp, the author, title and the comments.
My question is, do i have to create one separate spider for each domain? If not, how can I create a generic spider that allows me scrape by loading options from a configuration file or something similar?
I figured I could load the xpath expressions from a file which location can be loaded via command line but there seems to be some difficulties when scraping for some domain requires that I use regex select(expression_here).re(regex)
while some do not.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
例如,在 scrapy Spider 将 allowed_domains 设置为域列表:
希望有帮助
At scrapy spider set the allowed_domains to a list of domains for example :
hope it helps
好吧,我遇到了同样的问题,所以我使用
type()
动态创建了蜘蛛类,也就是说,为 'http://www.google.com' 我就这样做 -
希望这有帮助
Well I faced the same issue so I created the spider class dynamically using
type()
,So say, to create a spider for 'http://www.google.com' I'll just do -
Hope this helps
我使用以下 XPath 表达式做了同样的事情:
'/html/head/title/text()'
作为标题//p[string-length(text()) > 150]/text()
用于帖子内容。I do sort of the same thing using the following XPath expressions:
'/html/head/title/text()'
for the title//p[string-length(text()) > 150]/text()
for the post content.您可以使用空的allowed_domains属性来指示scrapy不要过滤任何场外请求。但在这种情况下,您必须小心,并且只返回来自蜘蛛的相关请求。
You can use a empty
allowed_domains
attribute to instruct scrapy not to filter any offsite request. But in that case you must be careful and only return relevant requests from your spider.您应该使用 BeautifulSoup,特别是如果您使用 Python。它使您能够查找页面中的元素,并使用正则表达式提取文本。
You should use BeautifulSoup especially if you're using Python. It enables you to find elements in the page, and extract text using regular expressions.
您可以使用start_request方法!
然后你也可以优先考虑每个网址!
然后最重要的是你可以传递一些元数据!
下面是一个有效的示例代码:
我建议您阅读此页面以获取 scrapy
https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests
希望这有帮助:)
You can use start_request method!
and then you can prioritize each url as well!
And then on top of that you can pass some meta data!
Here's a sample code that works:
I recommend you to read this page for more info at scrapy
https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests
Hope this helps :)