当 appengine 上的任务队列为空时运行函数

发布于 2024-10-29 11:06:28 字数 1229 浏览 5 评论 0原文

我每天都有一个 cron 作业来调用 API 并获取一些数据。对于每一行数据,我启动一个任务队列来处理数据(这涉及通过其他 API 查找数据)。当这一切完成后,我的数据在接下来的 24 小时内不会改变,所以我将其缓存起来。

有没有办法知道我排队的所有任务何时完成,以便我可以缓存数据?

目前,我通过安排两个 cron 作业以一种非常混乱的方式进行操作,如下所示:

class fetchdata(webapp.RequestHandler):
def get(self):
    todaykey = str(date.today())
    memcache.delete(todaykey)
    topsyurl = 'http://otter.topsy.com/search.json?q=site:open.spotify.com/album&window=d&perpage=20'
    f = urllib.urlopen(topsyurl)
    response = f.read()
    f.close()

    d = simplejson.loads(response)
    albums = d['response']['list']
    for album in albums:
        taskqueue.add(url='/spotifyapi/', params={'url':album['url'], 'score':album['score']})

class flushcache(webapp.RequestHandler):
    def get(self):
        todaykey = str(date.today())
        memcache.delete(todaykey)   

然后我的 cron.yaml 如下所示:

- description: gettopsy
  url: /fetchdata/
  schedule: every day 01:00
  timezone: Europe/London

- description: flushcache
  url: /flushcache/
  schedule: every day 01:05
  timezone: Europe/London

基本上 - 我猜测我的所有任务运行时间不会超过 5 分钟,所以我只是在 5 分钟后刷新缓存,这确保了数据缓存时是完整的。

有更好的编码方法吗?感觉我的解决方案不是最好的......

谢谢 汤姆

I have a cron job everyday to make a call to an API and fetch some data. For each row of the data I kick off a task queue to process the data (which involves looking up data via further APIs). Once all this has finished my data doesn't change for the next 24 hours so I memcache it.

Is there a way of knowing when all the tasks I queued up have finished so that I can cache the data?

Currently I do it in a really messy fashion by just scheduling two cron jobs like this:

class fetchdata(webapp.RequestHandler):
def get(self):
    todaykey = str(date.today())
    memcache.delete(todaykey)
    topsyurl = 'http://otter.topsy.com/search.json?q=site:open.spotify.com/album&window=d&perpage=20'
    f = urllib.urlopen(topsyurl)
    response = f.read()
    f.close()

    d = simplejson.loads(response)
    albums = d['response']['list']
    for album in albums:
        taskqueue.add(url='/spotifyapi/', params={'url':album['url'], 'score':album['score']})

class flushcache(webapp.RequestHandler):
    def get(self):
        todaykey = str(date.today())
        memcache.delete(todaykey)   

Then my cron.yaml looks like this:

- description: gettopsy
  url: /fetchdata/
  schedule: every day 01:00
  timezone: Europe/London

- description: flushcache
  url: /flushcache/
  schedule: every day 01:05
  timezone: Europe/London

Basically - I'm making a guess that all my tasks won't take more than 5 minutes to run so I just flush the cache 5 mins later and this ensures that when the data is cached it's complete.

Is there a better way of coding this? Feels like my solution isn't the best one....

Thanks
Tom

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

戒ㄋ 2024-11-05 11:06:28

目前没有任何方法可以确定您的任务何时完成执行。最好的选择是在数据存储中插入标记记录,并让每个任务在完成后删除其记录。然后,每个任务都可以检查它是否是最后一个任务,如果是,则执行清理/缓存。

There's not currently any way to determine when your tasks have finished executing. Your best option would be to insert marker records in the datastore, and have each task delete its record when it's done. Each task can then check if it's the last task, and perform your cleanup / caching if it is.

2024-11-05 11:06:28

我在处理同样的问题时发现了这个问题。我想出了一个不同的解决方案,我将其发布在这里,以防对其他人有用。

这并不是您所要求的直接替代,但它是相关的 - 我的问题是我想知道队列何时为空,因为这意味着复杂的后台进程已完成运行。所以我可以用检查“死机计时器”来代替检查队列大小,

死机时间是一个由某个进程不断重置的计时器。当该过程完成时,计时器不会重置并最终到期。因此,我将构成复杂后台进程一部分的所有不同任务都重置了计时器,并且我没有检查队列何时为空,而是有一个 cron 作业来检查计时器何时到期。

当然,为了提高效率,计时器必须避免始终写入数据存储。 http://acooke.org/cute/Deadmantim0.html 中的代码通过放宽行为略有不同,并使用 memcache 保存计时器对象的副本,并且仅在经过相当长的时间后才在存储中重置它。

ps 这比您所描述的更有效,因为它不需要经常写入数据库。它也更强大,因为您不必准确跟踪正在发生的事情。

i found this question while dealing with the same issue. i came up with a different solution which i'm posting here in case it's useful to others.

this isn't a direct replacement for what you are asking, but it's related - my problem was that i wanted to know when a queue was empty because that means that a complex background process had finished running. so i could replace checking the queue size with checking a "deadman timer"

a deadman time is a timer that is constantly reset by some process. when that process finishes then the timer is not reset and eventually expires. so i had all the different tasks that formed part of my complex background process reset the timer and, instead of checking when the queue was empty, i had a cron job that checked when the timer expired.

of course, for this to be efficient, the timer has to avoid writing to the data store all the time. the code at http://acooke.org/cute/Deadmantim0.html avoids this by relaxing the behaviour slightly and using memcache to hold a copy of the timer object and to only reset it in the store after a significant amount of time has passed.

ps this is more efficient than what you describe because it doesn't need to write as often to the database. it's also more robust because you don't have to keep track exactly of what is happening.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文