将资源类型限制为仅包含剧作家的 XHR

发布于 2025-01-10 06:43:41 字数 2730 浏览 1 评论 0原文

我只想使用 playwright_page_event_handlers 从 scrapy_playwright 返回 xhr。检查 jsonlines 文件后，我发现它没有成功限制为仅 xhrs。

我知道我可以在写入文件之前进行过滤，但是我想节省获取这些资源所需的时间，而不是在之后过滤所有内容。

如何将资源类型限制为仅 xhr？

这是我尝试过的：

from playwright.async_api import Response as PlaywrightResponse, BrowserContext
from scrapy_playwright.page import PageCoroutine
from scrapy import Spider, Request
import jsonlines

class EventSpider(Spider):
    name = "event"

    def start_requests(self):
        yield Request(
            url="http://quotes.toscrape.com/scroll",
            cookies={"foo": "bar", "asdf": "qwerty"},
            meta=dict(
                playwright=True,
                playwright_page_coroutines = [
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                ],
                playwright_page_event_handlers={
                    "response": "handle_response",
                    "context": self.configure_context
                },
            ),
        )
    async def configure_context(name: str, context: BrowserContext) -> None:
        def handle_route(route):
            if ("quotes" in route.request.post_data):
                route.fulfill()
            else:
                route.continue_()
        await context.route("/api/**", handle_route)

    async def handle_response(self, response: PlaywrightResponse) -> None:
        jl_file = "test.jl"
        data = {response.request.resource_type:[response.request.url]}
        with jsonlines.open(jl_file, mode='a') as writer:
            writer.write(data)

    def parse(self, response):
        return {"url": response.url}

产生以下输出：

{"document": ["http://quotes.toscrape.com/scroll"]}
{"stylesheet": ["http://quotes.toscrape.com/static/bootstrap.min.css"]}
{"stylesheet": ["http://quotes.toscrape.com/static/main.css"]}
{"script": ["http://quotes.toscrape.com/static/jquery.js"]}
{"stylesheet": ["https://fonts.googleapis.com/css?family=Raleway:400,700"]}
{"font": ["https://fonts.gstatic.com/s/raleway/v26/1Ptug8zYS_SKggPNyC0IT4ttDfA.woff2"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=1"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=2"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=3"]}

预期输出：

{"xhr": ["http://quotes.toscrape.com/api/quotes?page=1"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=2"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=3"]}

原文

I want to return only the xhr from scrapy_playwright using the playwright_page_event_handlers. After checking the jsonlines file, I find that it has not succesfully restricted to only the xhrs.

I know I can filter before writing the file, however I want to save the amount of time it takes to grab these resources rather than filtering everything after.

How can I restrict the resource types only to xhr?

Here's what I have tried:

from playwright.async_api import Response as PlaywrightResponse, BrowserContext
from scrapy_playwright.page import PageCoroutine
from scrapy import Spider, Request
import jsonlines

class EventSpider(Spider):
    name = "event"

    def start_requests(self):
        yield Request(
            url="http://quotes.toscrape.com/scroll",
            cookies={"foo": "bar", "asdf": "qwerty"},
            meta=dict(
                playwright=True,
                playwright_page_coroutines = [
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                ],
                playwright_page_event_handlers={
                    "response": "handle_response",
                    "context": self.configure_context
                },
            ),
        )
    async def configure_context(name: str, context: BrowserContext) -> None:
        def handle_route(route):
            if ("quotes" in route.request.post_data):
                route.fulfill()
            else:
                route.continue_()
        await context.route("/api/**", handle_route)

    async def handle_response(self, response: PlaywrightResponse) -> None:
        jl_file = "test.jl"
        data = {response.request.resource_type:[response.request.url]}
        with jsonlines.open(jl_file, mode='a') as writer:
            writer.write(data)

    def parse(self, response):
        return {"url": response.url}

Produces the following output:

{"document": ["http://quotes.toscrape.com/scroll"]}
{"stylesheet": ["http://quotes.toscrape.com/static/bootstrap.min.css"]}
{"stylesheet": ["http://quotes.toscrape.com/static/main.css"]}
{"script": ["http://quotes.toscrape.com/static/jquery.js"]}
{"stylesheet": ["https://fonts.googleapis.com/css?family=Raleway:400,700"]}
{"font": ["https://fonts.gstatic.com/s/raleway/v26/1Ptug8zYS_SKggPNyC0IT4ttDfA.woff2"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=1"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=2"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=3"]}

Expected output:

{"xhr": ["http://quotes.toscrape.com/api/quotes?page=1"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=2"]}
{"xhr": ["http://quotes.toscrape.com/api/quotes?page=3"]}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

许仙没带伞 2025-01-17 06:43:41

PLAYWRIGHT_ABORT_REQUEST

根据您的情况

def should_abort_request(request):
    return (
        request.resource_type != "xhr" in request.url
    )

并应用该函数在设置中

PLAYWRIGHT_ABORT_REQUEST = should_abort_request

PLAYWRIGHT_ABORT_REQUEST

In your case

def should_abort_request(request):
    return (
        request.resource_type != "xhr" in request.url
    )

and apply the function in settings

PLAYWRIGHT_ABORT_REQUEST = should_abort_request

回复收藏 0 原文

~没有更多了~

关于作者

辞别

暂无简介

文章

544 人气

关注发私信

友情链接

文江博客

将资源类型限制为仅包含剧作家的 XHR

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

PLAYWRIGHT_ABORT_REQUEST

PLAYWRIGHT_ABORT_REQUEST

关于作者

相关话题

热门标签

推荐作者

紫罗兰の梦幻

-2134

liuxuanli

意中人

○愚か者の日

xxhui

友情链接

将资源类型限制为仅包含剧作家的 XHR

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。