如何使用Scrapy从数据库中删除过期的项目
我正在使用蜘蛛抓取内容经常过期的视频网站。我正在考虑使用 scrapy 进行抓取,但不确定如何删除过期的项目。
检测项目是否过期的策略是:
- 抓取网站的“delete.rss”。
- 每隔几天,尝试重新加载内容页面并确保其仍然有效。
- 抓取网站内容索引的每个页面,如果找不到视频,则将其删除。
请告诉我如何删除 scrapy 中过期的项目。我将通过 django 将我的 scrapy 项目存储在 mysql 数据库中。
2010-01-18 更新
我找到了一个可行的解决方案,但仍然可能不是最佳的。我在同步的每个视频上维护一个“found_in_last_scan”标志。当蜘蛛启动时,它会将所有标志设置为 False。完成后,它会删除标志仍设置为 False 的视频。我通过附加到 signals.spider_opened
和 signals.spider_close
来完成此操作,请确认这是一个有效的策略并且没有任何问题。
I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items.
Strategies to detect if an item is expired are:
- Spider the site's "delete.rss".
- Every few days, try reloading the contents page and making sure it still works.
- Spider every page of the site's content indexes, and remove the video if it's not found.
Please let me know how to remove expired items in scrapy. I will be storing my scrapy items in a mysql DB via django.
2010-01-18 Update
I have found a solution that is working, but still may not be optimal. I am maintaining a "found_in_last_scan" flag on every video that I sync. When the spider starts, it sets all the flags to False. When it finishes, it deletes videos who still have the flag set to False. I did this by attaching to the signals.spider_opened
and signals.spider_closed
Please confirm this is a valid strategy and there are no problems with it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我还没有测试过!
我必须承认,我还没有尝试在 Scrapy 中使用 Django 模型,但这里是:
我想象的最简单的方法是通过扩展 XMLFeedSpider 为
deleted.rss
文件创建一个新的蜘蛛(从 scrapy 文档复制,然后修改)。我建议您创建一个新的蜘蛛,因为以下逻辑很少与用于抓取网站的逻辑相关:这不是可供您使用的工作蜘蛛,但 IIRC RSS 文件是纯 XML。我不确定
deleted.rss
是什么样子,但我确信您可以弄清楚如何从 XML 中提取 URL。现在,此示例导入myproject.items.DeletedUrlItem
,在此示例中它只是一个字符串,但您需要使用如下代码创建 DeletedUrlItem:您需要创建 DeletedUrlItem :
您不保存,而是删除项目在 Scrapy 的 ItemPipeline - 我假设您正在使用 DjangoItem:
注意
delete_item.delete()
。我知道这个答案可能包含错误,它是凭记忆写的:-),但如果您有评论或无法弄清楚这一点,我一定会更新。
I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:
The simplest way I imagine would be to create a new spider for the
deleted.rss
file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the
deleted.rss
looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example importsmyproject.items.DeletedUrlItem
which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:You need to create the DeletedUrlItem:
Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:
Notice the
delete_item.delete()
.I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.
如果您怀疑某个 HTTP URL 可能不再有效(因为您在“已删除”提要中发现了它,或者只是因为您有一段时间没有检查它),那么最简单、最快的检查方法是发送针对该 URL 的 HTTP
HEAD
请求。在 Python 中,最好使用 httplib 模块标准库的:使用 HTTPConnection (如果是 HTTP 1.1,则可以重复使用以检查具有更好性能和较低系统负载的多个 URL),然后执行一个(或多个,如果可行的话,即如果 HTTP 1.1正在使用)c
的 request 方法,第一个参数“HEAD”,第二个参数是您正在检查的 URL(当然没有主机部分;-)。在每个
请求
之后,您调用c.getresponse()
来获取HTTPResponse 对象,其status
属性将告诉您 URL 是否仍然有效。是的,它有点低级,但正是因为这个原因,它可以让您更好地优化您的任务,只需一点 HTTP 知识;-)。
If you have a HTTP URL which you suspect might not be valid at all any more (because you found it in a "deleted" feed, or just because you haven't checked it in a while), the simplest, fastest way to check is to send an HTTP
HEAD
request for that URL. In Python, that's best done with the httplib module of the standard library: make a connection objectc
to the host of interest with HTTPConnection (if HTTP 1.1, it may be reusable to check multiple URLs with better performance and lower systrem load), then do one (or more, if feasible, i.e. if HTTP 1.1 is in use) calls ofc
's request method, first argument 'HEAD', second argument the URL you're checking (without the host part of course;-).After each
request
you callc.getresponse()
to get an HTTPResponse object, whosestatus
attribute will tell you if the URL is still valid.Yes, it's a bit low-level, but exactly for this reason it lets you optimize your task a lot better, with just a little knowledge of HTTP;-).