如何使用Scrapy从数据库中删除过期的项目

发布于 2024-08-18 03:55:03 字数 578 浏览 9 评论 0原文

我正在使用蜘蛛抓取内容经常过期的视频网站。我正在考虑使用 scrapy 进行抓取，但不确定如何删除过期的项目。

检测项目是否过期的策略是：

抓取网站的“delete.rss”。
每隔几天，尝试重新加载内容页面并确保其仍然有效。
抓取网站内容索引的每个页面，如果找不到视频，则将其删除。

请告诉我如何删除 scrapy 中过期的项目。我将通过 django 将我的 scrapy 项目存储在 mysql 数据库中。

2010-01-18 更新

我找到了一个可行的解决方案，但仍然可能不是最佳的。我在同步的每个视频上维护一个“found_in_last_scan”标志。当蜘蛛启动时，它会将所有标志设置为 False。完成后，它会删除标志仍设置为 False 的视频。我通过附加到 signals.spider_opened 和 signals.spider_close 来完成此操作，请确认这是一个有效的策略并且没有任何问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

开始看清了 2024-08-25 03:55:03

我还没有测试过！
我必须承认，我还没有尝试在 Scrapy 中使用 Django 模型，但这里是：

我想象的最简单的方法是通过扩展 XMLFeedSpider 为 deleted.rss 文件创建一个新的蜘蛛（从 scrapy 文档复制，然后修改）。我建议您创建一个新的蜘蛛，因为以下逻辑很少与用于抓取网站的逻辑相关：

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

这不是可供您使用的工作蜘蛛，但 IIRC RSS 文件是纯 XML。我不确定 deleted.rss 是什么样子，但我确信您可以弄清楚如何从 XML 中提取 URL。现在，此示例导入 myproject.items.DeletedUrlItem，在此示例中它只是一个字符串，但您需要使用如下代码创建 DeletedUrlItem：

您需要创建 DeletedUrlItem ：

class DeletedUrlItem(Item):
    url = Field()

您不保存，而是删除项目在 Scrapy 的 ItemPipeline - 我假设您正在使用 DjangoItem：

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

注意delete_item.delete()。

我知道这个答案可能包含错误，它是凭记忆写的:-)，但如果您有评论或无法弄清楚这一点，我一定会更新。

I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:

The simplest way I imagine would be to create a new spider for the deleted.rss file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the deleted.rss looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example imports myproject.items.DeletedUrlItem which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:

You need to create the DeletedUrlItem:

class DeletedUrlItem(Item):
    url = Field()

Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

Notice the delete_item.delete().

I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.

回复收藏 0 原文

情栀口红 2024-08-25 03:55:03

如果您怀疑某个 HTTP URL 可能不再有效（因为您在“已删除”提要中发现了它，或者只是因为您有一段时间没有检查它），那么最简单、最快的检查方法是发送针对该 URL 的 HTTP HEAD 请求。在 Python 中，最好使用 httplib 模块标准库的：使用 HTTPConnection （如果是 HTTP 1.1，则可以重复使用以检查具有更好性能和较低系统负载的多个 URL），然后执行一个（或多个，如果可行的话，即如果 HTTP 1.1正在使用） c 的 request 方法，第一个参数“HEAD”，第二个参数是您正在检查的 URL（当然没有主机部分；-）。

在每个请求之后，您调用c.getresponse()来获取HTTPResponse 对象，其 status 属性将告诉您 URL 是否仍然有效。

是的，它有点低级，但正是因为这个原因，它可以让您更好地优化您的任务，只需一点 HTTP 知识；-)。