当前位置：文江博客话题详情

我应该创建管道来使用 scrapy 保存文件吗？

发布于 2024-11-30 05:40:19 字数 165 浏览 1 评论 0原文

我需要保存文件 (.pdf)，但我不确定如何操作。我需要保存 .pdf 并将它们存储在一个目录中，就像它们存储在我正在刮掉它们的网站上一样。

据我所知，我需要建立一个管道，但据我了解，管道保存的“项目”和“项目”只是像字符串/数字这样的基本数据。保存文件是管道的正确使用，还是应该将文件保存在蜘蛛中？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

豆芽 2024-12-07 05:40:19

是与否[1]。如果您获取 pdf，它将存储在内存中，但如果 pdf 不够大，无法填满您的可用内存，那么也可以。

您可以将 pdf 保存在蜘蛛回调中：

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

如果您选择在管道中执行此操作：

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i

# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item

[1] 另一种方法可能是仅存储 pdf 的 url，并使用另一个进程来获取文档而不缓冲到内存中。（例如wget）

Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

You could save the pdf in the spider callback:

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

If you choose to do it in a pipeline:

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i

# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item

[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)

回复收藏 0 原文