scrapy 可以用来从使用 AJAX 的网站上抓取动态内容吗？

发布于 2024-12-21 21:17:34 字数 631 浏览 2 评论 0原文

我最近正在学习 Python，并正在尝试构建一个网络爬虫。这根本不是什么奇特的事情；它的唯一目的是从博彩网站获取数据并将这些数据放入 Excel 中。

大多数问题都是可以解决的，而我却遇到了一些麻烦。然而，我在一个问题上遇到了巨大的障碍。如果站点加载马表并列出当前投注价格，则此信息不会出现在任何源文件中。线索是这些数据有时是实时的，这些数字显然是从某个远程服务器更新的。我电脑上的 HTML 只是有一个漏洞，他们的服务器正在其中推送我需要的所有有趣数据。

现在我对动态网页内容的经验很少，所以这件事我很难理解。

我认为 Java 或 Javascript 是关键，这个经常出现。

刮刀只是一个赔率比较引擎。有些网站有 API，但我需要为那些没有 API 的网站提供 API。我正在使用带有 Python 2.7 的 scrapy 库，

如果这个问题过于开放，我深表歉意。简而言之，我的问题是：如何使用scrapy来抓取这些动态数据以便我可以使用它？这样我就可以实时抓取这个投注赔率数据？

_{另请参阅：如何在 Python 中抓取包含动态内容（由 JavaScript 创建）的页面？了解一般情况。}

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

划一舟意中人 2024-12-28 21:17:34

这是使用 AJAX 请求的 scrapy 的简单示例。让我们看看网站 rubin-kazan.ru 。

所有消息均通过 AJAX 请求加载。我的目标是获取这些消息及其所有属性（作者、日期等）：

此处输入图像描述">

在此处输入图像描述

它不会重新加载整个页面，而只会重新加载页面中包含消息的部分。为此，我单击底部的任意数量的页面：

在此处输入图像描述

我观察到的 HTTP 请求是负责消息正文：

在此处输入图像描述

完成后，我分析请求的标头（我必须引用此 URL我将从 var 的源页面中提取部分，参见下面的代码）：

在此处输入图像描述

以及请求的表单数据内容（HTTP 方法为“ Post"):

在此处输入图像描述

响应的内容是一个 JSON 文件：

在此处输入图像描述

其中显示了我正在查找的所有信息。

从现在开始，我必须将所有这些知识都用scrapy来实现。让我们为此目的定义蜘蛛：

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

在 parse 函数中，我有第一个请求的响应。
在 RubiGuessItem 中，我有包含所有信息的 JSON 文件。

Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.

All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, ...):

enter image description here

When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:

enter image description here

It doesn't reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:

enter image description here

And I observe the HTTP request that is responsible for message body:

enter image description here

After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):

enter image description here

And the form data content of the request (the HTTP method is "Post"):

enter image description here

And the content of response, which is a JSON file:

enter image description here

Which presents all the information I'm looking for.

From now, I must implement all this knowledge in scrapy. Let's define the spider for this purpose:

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

In parse function I have the response for first request.
In RubiGuessItem I have the JSON file with all information.

回复收藏 0 原文

拥抱我好吗 2024-12-28 21:17:34

基于 Webkit 的浏览器（例如 Google Chrome 或 Safari）具有内置的开发人员工具。在 Chrome 中，您可以打开它 Menu->Tools->Developer Tools。 Network 选项卡允许您查看有关每个请求和响应的所有信息：

在此处输入图像描述

在图片底部，您可以看到我已将请求过滤为 XHR - 这些是由 javascript 代码发出的请求。

提示：每次加载页面时都会清除日志，图片底部的黑点按钮将保留日志。

分析请求和响应后，您可以模拟来自网络爬虫的这些请求，并提取有价值的数据。在许多情况下，获取数据比解析 HTML 更容易，因为该数据不包含表示逻辑，并且经过格式化以供 JavaScript 代码访问。

Firefox 有类似的扩展，称为 firebug。有些人会说 firebug 更强大，但我喜欢 webkit 的简单性。

回复收藏 0 原文

我乃一代侩神 2024-12-28 21:17:34

很多时候，当爬行时，我们会遇到问题，页面上呈现的内容是用 Javascript 生成的，因此 scrapy 无法爬行它（例如 ajax 请求、jQuery 疯狂）。

但是，如果您将 Scrapy 与 Web 测试框架 Selenium 一起使用，那么我们就可以抓取普通 Web 浏览器中显示的任何内容。

需要注意的一些事项：

您必须安装 Python 版本的 Selenium RC 才能正常工作，并且必须正确设置 Selenium。另外，这只是一个模板爬虫。你可以变得更疯狂、更先进，但我只是想展示基本的想法。按照现在的代码，您将对任何给定的 url 执行两个请求。一个请求是由 Scrapy 发出的，另一个请求是由 Selenium 发出的。我确信有办法解决这个问题，这样你就可以让 Selenium 执行唯一的请求，但我没有费心去实现它，并且通过执行两个请求，你也可以使用 Scrapy 抓取页面。

这非常强大，因为现在您可以抓取整个渲染的 DOM，并且仍然可以使用 Scrapy 中所有出色的抓取功能。当然，这会导致爬行速度变慢，但根据您需要渲染的 DOM 的程度，等待可能是值得的。

从 scrapy.contrib.spiders 导入 CrawlSpider，规则
从 scrapy.contrib.linkextractors.sgml 导入 SgmlLinkExtractor
从 scrapy.selector 导入 HtmlXPathSelector
从 scrapy.http 导入请求

从硒进口硒

类 SeleniumSpider(CrawlSpider):
    名称 =“SeleniumSpider”
    start_urls = ["http://www.domain.com"]

    规则=（
        规则(SgmlLinkExtractor(allow=('\.html', )), 回调='parse_page',follow=True),
    ）

    def __init__(自身):
        CrawlSpider.__init__(自身)
        self.verificationErrors = []
        self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
        self.selenium.start()

    def __del__(自我):
        self.selenium.stop()
        打印自我验证错误
        CrawlSpider.__del__(自我)

    def parse_page(self, 响应):
        项目 = 项目()

        hxs = HtmlXPathSelector（响应）
        #使用 Scrapy 进行一些 XPath 选择
        hxs.select('//div').extract()

        sel = self.selenium
        sel.open(response.url)

        #等待 javscript 在 Selenium 中加载
        时间.睡眠(2.5)

        #用 Selenium 对 javascript 创建的内容进行一些抓取
        sel.get_text("//div")
        产量项

# 从 snippets.scrapy.org 导入的片段（不再有效）
# 作者：温本内特
# 日期：2011 年 6 月 21 日

参考：http://snipplr.com/view/66998/

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).

However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.

Some things to note:

You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.

This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

from selenium import selenium

class SeleniumSpider(CrawlSpider):
    name = "SeleniumSpider"
    start_urls = ["http://www.domain.com"]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
    )

    def __init__(self):
        CrawlSpider.__init__(self)
        self.verificationErrors = []
        self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
        self.selenium.start()

    def __del__(self):
        self.selenium.stop()
        print self.verificationErrors
        CrawlSpider.__del__(self)

    def parse_page(self, response):
        item = Item()

        hxs = HtmlXPathSelector(response)
        #Do some XPath selection with Scrapy
        hxs.select('//div').extract()

        sel = self.selenium
        sel.open(response.url)

        #Wait for javscript to load in Selenium
        time.sleep(2.5)

        #Do some crawling of javascript created content with Selenium
        sel.get_text("//div")
        yield item

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: wynbennett
# date  : Jun 21, 2011

Reference: http://snipplr.com/view/66998/

回复收藏 0 原文

此岸叶落 2024-12-28 21:17:34

另一种解决方案是实现下载处理程序或下载处理程序中间件。（有关更多信息，请参阅 scrapy 文档下载器中间件）以下是使用 selenium 和 headless phantomjs webdriver 的示例类：

1) 在 middlewares.py 脚本中定义类。

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2) 将 JsDownload() 类添加到 settings.py 内的变量 DOWNLOADER_MIDDLEWARE：

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3) 将 HTMLResponse 集成到 your_spider.py 中。解码响应正文将为您提供所需的输出。

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8")

可选插件：
我希望能够告诉不同的蜘蛛使用哪个中间件，所以我实现了这个包装器：

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

为了使包装器正常工作，所有蜘蛛必须至少具备：

middleware = set([])

包含一个中间件：

middleware = set([MyProj.middleware.ModuleName.ClassName])

优点：
以这种方式实现它而不是在蜘蛛中实现它的主要优点是您最终只发出一个请求。例如，在 A T 的解决方案中：下载处理程序处理请求，然后将响应交给蜘蛛。然后，蜘蛛在其 parse_page 函数中发出一个全新的请求——这是对相同内容的两个请求。

Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:

1) Define class within the middlewares.py script.

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8")

Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

for wrapper to work all spiders must have at minimum:

middleware = set([])

to include a middleware:

middleware = set([MyProj.middleware.ModuleName.ClassName])

Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T's solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.

回复收藏 0 原文

忆悲凉 2024-12-28 21:17:34

我使用了自定义下载器中间件，但对它不是很满意，因为我无法使缓存与它一起工作。

更好的方法是实现自定义下载处理程序。

此处有一个工作示例。它看起来像这样：

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

假设你的刮刀被称为“scraper”。如果您将提到的代码放在“scraper”文件夹根部的一个名为 handlers.py 的文件中，那么您可以将以下内容添加到您的 settings.py 中：

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

瞧，JS 解析的 DOM，具有 scrapy 缓存、重试等功能。

I was using a custom downloader middleware, but wasn't very happy with it, as I didn't manage to make the cache work with it.

A better approach was to implement a custom download handler.

There is a working example here. It looks like this:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

And voilà, the JS parsed DOM, with scrapy cache, retries, etc.

回复收藏 0 原文

西瓜 2024-12-28 21:17:34

如何使用scrapy来抓取这些动态数据以便我可以使用
是吗？

我想知道为什么没有人只使用 Scrapy 发布解决方案。

查看 Scrapy 团队的博客文章抓取无限滚动页面
。该示例废弃了使用无限滚动的 http://spidyquotes.herokuapp.com/scroll 网站。

这个想法是使用浏览器的开发人员工具并注意 AJAX 请求，然后根据该信息创建 Scrapy 请求。

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

how can scrapy be used to scrape this dynamic data so that I can use
it?

I wonder why no one has posted the solution using Scrapy only.

Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES
. The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.

The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy.

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

回复收藏 0 原文

献世佛 2024-12-28 21:17:34

从 API 外部 url 生成的数据将 HTML 响应作为 POST 方法调用。

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'  
    def start_requests(self):
        url = 'https://howlongtobeat.com/search_results?page=1'
        payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
        headers = {
            "content-type":"application/x-www-form-urlencoded",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }

        yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)

    def parse(self, response):
        cards = response.css('div[class="search_list_details"]')

        for card in cards: 
            game_name = card.css('a[class=text_white]::attr(title)').get()
            yield {
                "game_name":game_name
            }
           

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

Data that generated from external url which is API calls HTML response as POST method.

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'  
    def start_requests(self):
        url = 'https://howlongtobeat.com/search_results?page=1'
        payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
        headers = {
            "content-type":"application/x-www-form-urlencoded",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }

        yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)

    def parse(self, response):
        cards = response.css('div[class="search_list_details"]')

        for card in cards: 
            game_name = card.css('a[class=text_white]::attr(title)').get()
            yield {
                "game_name":game_name
            }
           

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

回复收藏 0 原文

和影子一齐双人舞 2024-12-28 21:17:34

我认为 2022 年有一些更现代的替代方案值得一提，我想列出该问题的更受欢迎的答案中讨论的方法的一些优缺点。

最佳答案和其他几个讨论使用浏览器开发工具或数据包捕获软件来尝试识别响应url中的模式，并尝试重新构建它们用作 scrapy.Requests。
- 优点：在我看来，这仍然是最好的选择，当它可用时，它比传统方法（即使用 xpath 从 HTML 中提取内容）更快而且通常更简单和 css 选择器。
- 缺点：不幸的是，这仅适用于一小部分动态网站，并且通常网站都采取了适当的安全措施，因此很难使用此策略。
使用 Selenium Webdriver 是之前答案中多次提到的另一种方法。
- 优点：它很容易实现，并且可以集成到 scrapy 工作流程中。此外，还有大量示例，如果您使用像 scrapy-selenium
  这样的第 3 方扩展，则需要很少的配置
- 缺点：速度慢！ scrapy 的主要功能之一是它的异步工作流程，可以轻松地在几秒钟内爬取数十甚至数百个页面。使用硒可以显着减少这种情况。

有两种新方法绝对值得考虑，scrapy-splash 和 scrapy-playwright。

scrapy-splash：

一个scrapy插件，集成了splash，一个由 scrapy 开发人员创建和维护的 javascript 渲染服务，融入到 scrapy 工作流程中。该插件可以使用pip3 install scrapy-splash从pypi安装，而splash需要在它自己的进程中运行，并且最容易从docker容器运行。

scrapy-playwright：

Playwright是一种浏览器自动化工具就像selenium一样，但不会因使用selenium而导致速度严重下降。 Playwright 可以轻松融入异步 scrapy 工作流程，使得发送请求与单独使用 scrapy 一样快。它也比 selenium 更容易安装和集成。 scrapy-playwright 插件也由 scrapy 的开发人员维护，通过 pypi 使用 pip3 install scrapy-playwright 安装后就像运行 playwright install 一样简单 在终端中。

更多详细信息和许多示例可以在每个插件的 github 页面找到 https://github.com /scrapy-plugins/scrapy-playwright 和 https://github.com/scrapy-plugins/scrapy-splash。

ps 根据我的经验，这两个项目在 Linux 环境中往往运行得更好。对于 Windows 用户，我建议将其与 The Windows Subsystem for Linux(wsl)。

There are a few more modern alternatives in 2022 that I think should be mentioned, and I would like to list some pros and cons for the methods discussed in the more popular answers to this question.

The top answer and several others discuss using the browsers dev tools or packet capturing software to try to identify patterns in response url's, and try to re-construct them to use as scrapy.Requests.
- Pros: This is still the best option in my opinion, and when it is available it is quick and often times simpler than even the traditional approach i.e. extracting content from the HTML using xpath and css selectors.
- Cons: Unfortunately this is only available on a fraction of dynamic sites and frequently websites have security measures in place that make using this strategy difficult.
Using Selenium Webdriver is the other approach mentioned a lot in previous answers.
- Pros: It's easy to implement, and integrate into the scrapy workflow. Additionally there are a ton of examples, and requires very little configuration if you use 3rd-party extensions like scrapy-selenium
- Cons: It's slow! One of scrapy's key features is it's asynchronous workflow that makes it easy to crawl dozens or even hundreds of pages in seconds. Using selenium cuts this down significantly.

There are two new methods that defenitely worth consideration, scrapy-splash and scrapy-playwright.

scrapy-splash:

A scrapy plugin that integrates splash, a javascript rendering service created and maintained by the developers of scrapy, into the scrapy workflow. The plugin can be installed from pypi with pip3 install scrapy-splash, while splash needs to run in it's own process, and is easiest to run from a docker container.

scrapy-playwright:

Playwright is a browser automation tool kind of like selenium, but without the crippling decrease in speed that comes with using selenium. Playwright has no issues fitting into the asynchronous scrapy workflow making sending requests just as quick as using scrapy alone. It is also much easier to install and integrate than selenium. The scrapy-playwright plugin is maintained by the developers of scrapy as well, and after installing via pypi with pip3 install scrapy-playwright is as easy as running playwright install in the terminal.

More details and many examples can be found at each of the plugin's github pages https://github.com/scrapy-plugins/scrapy-playwright and https://github.com/scrapy-plugins/scrapy-splash.

p.s. Both projects tend to work better in a linux environment in my experience. for windows users i recommend using it with The Windows Subsystem for Linux(wsl).

回复收藏 0 原文