我应该创建管道来使用 scrapy 保存文件吗?

发布于 2024-11-30 05:40:19 字数 165 浏览 1 评论 0原文

我需要保存文件 (.pdf),但我不确定如何操作。我需要保存 .pdf 并将它们存储在一个目录中,就像它们存储在我正在刮掉它们的网站上一样。

据我所知,我需要建立一个管道,但据我了解,管道保存的“项目”和“项目”只是像字符串/数字这样的基本数据。保存文件是管道的正确使用,还是应该将文件保存在蜘蛛中?

I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.

From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

豆芽 2024-12-07 05:40:19

是与否[1]。如果您获取 pdf,它将存储在内存中,但如果 pdf 不够大,无法填满您的可用内存,那么也可以。

您可以将 pdf 保存在蜘蛛回调中:

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

如果您选择在管道中执行此操作:

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i

# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item

[1] 另一种方法可能是仅存储 pdf 的 url,并使用另一个进程来获取文档而不缓冲到内存中。 (例如wget

Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

You could save the pdf in the spider callback:

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

If you choose to do it in a pipeline:

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i

# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item

[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)

绳情 2024-12-07 05:40:19

有一个FilesPipeline可以直接使用,假设你已经有了文件url,链接显示了如何使用 FilesPipeline:

https://groups.google.com/forum/print/msg/scrapy -用户/kzGHFjXywuY/O6PIhoT3thsJ

There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:

https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ

眼眸里的那抹悲凉 2024-12-07 05:40:19

它是完成这项工作的完美工具。 Scrapy 的工作方式是让蜘蛛将网页转换为结构化数据(项目)。管道是后处理器,但它们使用与蜘蛛相同的异步基础设施,因此它非常适合获取媒体文件。

在您的情况下,您首先在蜘蛛中提取 PDF 的位置,在管道中获取它们,并使用另一个管道来保存项目。

It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.

In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文