当前位置：文江博客话题详情

Python Scrapy screen-scraping

使用 scrapy 抓取多个域的最佳方法是什么？

发布于 2024-10-29 00:28:42 字数 269 浏览 6 评论 0原文

我希望从中刮掉大约10个奇数网站。其中一些是WordPress博客，尽管有不同的类别，但它们遵循相同的HTML结构。其他是其他格式的论坛或博客。

我喜欢刮擦的信息很常见 - 帖子内容，时间戳，作者，标题和评论。

我的问题是，我必须为每个域创建一个单独的蜘蛛吗？如果没有，如何创建一个通用蜘蛛，该通用蜘蛛可以通过从配置文件中加载选项或类似内容来刮擦？

我认为我可以从一个可以通过命令行加载的位置加载xpath表达式，但是在刮擦某些域时似乎遇到了一些困难代码>虽然有些没有。

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（6）

み零 2024-11-05 00:28:42

例如，在 scrapy Spider 将 allowed_domains 设置为域列表：

class YourSpider(CrawlSpider):    
   allowed_domains = [ 'domain1.com','domain2.com' ]

希望有帮助

At scrapy spider set the allowed_domains to a list of domains for example :

class YourSpider(CrawlSpider):    
   allowed_domains = [ 'domain1.com','domain2.com' ]

hope it helps

回复收藏 0 原文

只是我以为 2024-11-05 00:28:42

好吧，我遇到了同样的问题，所以我使用 type() 动态创建了蜘蛛类，

from scrapy.contrib.spiders import CrawlSpider
import urlparse

class GenericSpider(CrawlSpider):
    """a generic spider, uses type() to make new spider classes for each domain"""
    name = 'generic'
    allowed_domains = []
    start_urls = []

    @classmethod
    def create(cls, link):
        domain = urlparse.urlparse(link).netloc.lower()
        # generate a class name such that domain www.google.com results in class name GoogleComGenericSpider
        class_name = (domain if not domain.startswith('www.') else domain[4:]).title().replace('.', '') + cls.__name__
        return type(class_name, (cls,), {
            'allowed_domains': [domain],
            'start_urls': [link],
            'name': domain
        })

也就是说，为 'http://www.google.com' 我就这样做 -

In [3]: google_spider = GenericSpider.create('http://www.google.com')

In [4]: google_spider
Out[4]: __main__.GoogleComGenericSpider

In [5]: google_spider.name
Out[5]: 'www.google.com'

希望这有帮助

Well I faced the same issue so I created the spider class dynamically using type(),

from scrapy.contrib.spiders import CrawlSpider
import urlparse

class GenericSpider(CrawlSpider):
    """a generic spider, uses type() to make new spider classes for each domain"""
    name = 'generic'
    allowed_domains = []
    start_urls = []

    @classmethod
    def create(cls, link):
        domain = urlparse.urlparse(link).netloc.lower()
        # generate a class name such that domain www.google.com results in class name GoogleComGenericSpider
        class_name = (domain if not domain.startswith('www.') else domain[4:]).title().replace('.', '') + cls.__name__
        return type(class_name, (cls,), {
            'allowed_domains': [domain],
            'start_urls': [link],
            'name': domain
        })

So say, to create a spider for 'http://www.google.com' I'll just do -

In [3]: google_spider = GenericSpider.create('http://www.google.com')

In [4]: google_spider
Out[4]: __main__.GoogleComGenericSpider

In [5]: google_spider.name
Out[5]: 'www.google.com'

Hope this helps

回复收藏 0 原文

为你鎻心 2024-11-05 00:28:42

我使用以下 XPath 表达式做了同样的事情：

'/html/head/title/text()' 作为标题
//p[string-length(text()) > 150]/text() 用于帖子内容。

回复收藏 0 原文

§对你不离不弃 2024-11-05 00:28:42

您可以使用空的allowed_domains属性来指示scrapy不要过滤任何场外请求。但在这种情况下，您必须小心，并且只返回来自蜘蛛的相关请求。

回复收藏 0 原文

卷耳 2024-11-05 00:28:42

您应该使用 BeautifulSoup，特别是如果您使用 Python。它使您能够查找页面中的元素，并使用正则表达式提取文本。

回复收藏 0 原文

最偏执的依靠 2024-11-05 00:28:42

您可以使用start_request方法！

然后你也可以优先考虑每个网址！
然后最重要的是你可以传递一些元数据！

下面是一个有效的示例代码：

"""
For allowed_domains:
Let’s say your target url is https://www.example.com/1.html, 
then add 'example.com' to the list.
"""
class crawler(CrawlSpider):
    name = "crawler_name"

    allowed_domains, urls_to_scrape = parse_urls()
    rules = [
        Rule(LinkExtractor(
            allow=['.*']),
             callback='parse_item',
             follow=True)
        ]
    def start_requests(self):
        for i,url in enumerate(self.urls_to_scrape):
            yield scrapy.Request(url=url.strip(),callback=self.parse_item, priority=i+1, meta={"pass_anydata_hare":1})

    def parse_item(self, response):
        response = response.css('logic')

        yield {'link':str(response.url),'extracted data':[],"meta_data":'data you passed' }

我建议您阅读此页面以获取 scrapy

https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests

希望这有帮助:)

You can use start_request method!

and then you can prioritize each url as well!
And then on top of that you can pass some meta data!

Here's a sample code that works:

"""
For allowed_domains:
Let’s say your target url is https://www.example.com/1.html, 
then add 'example.com' to the list.
"""
class crawler(CrawlSpider):
    name = "crawler_name"

    allowed_domains, urls_to_scrape = parse_urls()
    rules = [
        Rule(LinkExtractor(
            allow=['.*']),
             callback='parse_item',
             follow=True)
        ]
    def start_requests(self):
        for i,url in enumerate(self.urls_to_scrape):
            yield scrapy.Request(url=url.strip(),callback=self.parse_item, priority=i+1, meta={"pass_anydata_hare":1})

    def parse_item(self, response):
        response = response.css('logic')

        yield {'link':str(response.url),'extracted data':[],"meta_data":'data you passed' }

I recommend you to read this page for more info at scrapy

https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests

Hope this helps :)

回复收藏 0 原文

~没有更多了~

关于作者

旧时光的容颜

暂无简介

0 文章

0 评论

24 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

游缘惊梦

文章 0 评论 0

小兔几

文章 0 评论 0

Glik

文章 0 评论 0

生生漫

文章 0 评论 0

Luxian

文章 0 评论 0

Champion-Ming

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文