我应该创建管道来使用 scrapy 保存文件吗?
我需要保存文件 (.pdf),但我不确定如何操作。我需要保存 .pdf 并将它们存储在一个目录中,就像它们存储在我正在刮掉它们的网站上一样。
据我所知,我需要建立一个管道,但据我了解,管道保存的“项目”和“项目”只是像字符串/数字这样的基本数据。保存文件是管道的正确使用,还是应该将文件保存在蜘蛛中?
I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.
From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
是与否[1]。如果您获取 pdf,它将存储在内存中,但如果 pdf 不够大,无法填满您的可用内存,那么也可以。
您可以将 pdf 保存在蜘蛛回调中:
如果您选择在管道中执行此操作:
[1] 另一种方法可能是仅存储 pdf 的 url,并使用另一个进程来获取文档而不缓冲到内存中。 (例如
wget
)Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.
You could save the pdf in the spider callback:
If you choose to do it in a pipeline:
[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g.
wget
)有一个FilesPipeline可以直接使用,假设你已经有了文件url,链接显示了如何使用 FilesPipeline:
https://groups.google.com/forum/print/msg/scrapy -用户/kzGHFjXywuY/O6PIhoT3thsJ
There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:
https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ
它是完成这项工作的完美工具。 Scrapy 的工作方式是让蜘蛛将网页转换为结构化数据(项目)。管道是后处理器,但它们使用与蜘蛛相同的异步基础设施,因此它非常适合获取媒体文件。
在您的情况下,您首先在蜘蛛中提取 PDF 的位置,在管道中获取它们,并使用另一个管道来保存项目。
It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.
In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.