如何使用Scrapy从数据库中删除过期的项目

发布于 2024-08-18 03:55:03 字数 578 浏览 6 评论 0原文

我正在使用蜘蛛抓取内容经常过期的视频网站。我正在考虑使用 scrapy 进行抓取,但不确定如何删除过期的项目。

检测项目是否过期的策略是:

  1. 抓取网站的“delete.rss”。
  2. 每隔几天,尝试重新加载内容页面并确保其仍然有效。
  3. 抓取网站内容索引的每个页面,如果找不到视频,则将其删除。

请告诉我如何删除 scrapy 中过期的项目。我将通过 django 将我的 scrapy 项目存储在 mysql 数据库中。

2010-01-18 更新

我找到了一个可行的解决方案,但仍然可能不是最佳的。我在同步的每个视频上维护一个“found_in_last_scan”标志。当蜘蛛启动时,它会将所有标志设置为 False。完成后,它会删除标志仍设置为 False 的视频。我通过附加到 signals.spider_openedsignals.spider_close 来完成此操作,请确认这是一个有效的策略并且没有任何问题。

I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items.

Strategies to detect if an item is expired are:

  1. Spider the site's "delete.rss".
  2. Every few days, try reloading the contents page and making sure it still works.
  3. Spider every page of the site's content indexes, and remove the video if it's not found.

Please let me know how to remove expired items in scrapy. I will be storing my scrapy items in a mysql DB via django.

2010-01-18 Update

I have found a solution that is working, but still may not be optimal. I am maintaining a "found_in_last_scan" flag on every video that I sync. When the spider starts, it sets all the flags to False. When it finishes, it deletes videos who still have the flag set to False. I did this by attaching to the signals.spider_opened and signals.spider_closed Please confirm this is a valid strategy and there are no problems with it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

开始看清了 2024-08-25 03:55:03

我还没有测试过!
我必须承认,我还没有尝试在 Scrapy 中使用 Django 模型,但这里是:

我想象的最简单的方法是通过扩展 XMLFeedSpider 为 deleted.rss 文件创建一个新的蜘蛛(从 scrapy 文档复制,然后修改)。我建议您创建一个新的蜘蛛,因为以下逻辑很少与用于抓取网站的逻辑相关:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

不是可供您使用的工作蜘蛛,但 IIRC RSS 文件是纯 XML。我不确定 deleted.rss 是什么样子,但我确信您可以弄清楚如何从 XML 中提取 URL。现在,此示例导入​​ myproject.items.DeletedUrlItem,在此示例中它只是一个字符串,但您需要使用如下代码创建 DeletedUrlItem:

您需要创建 DeletedUrlItem :

class DeletedUrlItem(Item):
    url = Field()

您不保存,而是删除项目在 Scrapy 的 ItemPipeline - 我假设您正在使用 DjangoItem

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

注意delete_item.delete()


我知道这个答案可能包含错误,它是凭记忆写的:-),但如果您有评论或无法弄清楚这一点,我一定会更新。

I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:

The simplest way I imagine would be to create a new spider for the deleted.rss file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the deleted.rss looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example imports myproject.items.DeletedUrlItem which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:

You need to create the DeletedUrlItem:

class DeletedUrlItem(Item):
    url = Field()

Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

Notice the delete_item.delete().


I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.

情栀口红 2024-08-25 03:55:03

如果您怀疑某个 HTTP URL 可能不再有效(因为您在“已删除”提要中发现了它,或者只是因为您有一段时间没有检查它),那么最简单、最快的检查方法是发送针对该 URL 的 HTTP HEAD 请求。在 Python 中,最好使用 httplib 模块标准库的:使用 HTTPConnection (如果是 HTTP 1.1,则可以重复使用以检查具有更好性能和较低系统负载的多个 URL),然后执行一个(或多个,如果可行的话,即如果 HTTP 1.1正在使用) crequest 方法,第一个参数“HEAD”,第二个参数是您正在检查的 URL(当然没有主机部分;-)。

在每个请求之后,您调用c.getresponse()来获取HTTPResponse 对象,其 status 属性将告诉您 URL 是否仍然有效。

是的,它有点低级,但正是因为这个原因,它可以让您更好地优化您的任务,只需一点 HTTP 知识;-)。

If you have a HTTP URL which you suspect might not be valid at all any more (because you found it in a "deleted" feed, or just because you haven't checked it in a while), the simplest, fastest way to check is to send an HTTP HEAD request for that URL. In Python, that's best done with the httplib module of the standard library: make a connection object c to the host of interest with HTTPConnection (if HTTP 1.1, it may be reusable to check multiple URLs with better performance and lower systrem load), then do one (or more, if feasible, i.e. if HTTP 1.1 is in use) calls of c's request method, first argument 'HEAD', second argument the URL you're checking (without the host part of course;-).

After each request you call c.getresponse() to get an HTTPResponse object, whose status attribute will tell you if the URL is still valid.

Yes, it's a bit low-level, but exactly for this reason it lets you optimize your task a lot better, with just a little knowledge of HTTP;-).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文