我使用 Django 的默认站点地图应用程序实现了一个简单的站点地图类。由于执行时间较长,我添加了手动缓存:
class ShortReviewsSitemap(Sitemap):
changefreq = "hourly"
priority = 0.7
def items(self):
# Try to retrieve from cache
result = get_cache(CACHE_SITEMAP_SHORT_REVIEWS, "sitemap_short_reviews")
if result!=None:
return result
result = ShortReview.objects.all().order_by("-created_at")
# Store in cache
set_cache(CACHE_SITEMAP_SHORT_REVIEWS, "sitemap_short_reviews", result)
return result
def lastmod(self, obj):
return obj.updated_at
问题是 Memcached最多只允许 1MB 的对象。这个文件大于 1 MB,因此将其存储到缓存中失败:
>7 SERVER_ERROR object too large for cache
问题是 Django 有一种自动方式来决定何时应将站点地图文件划分为较小的文件。根据文档:
如果有的话,您应该创建一个索引文件
您的站点地图超过 50,000 个
网址。在这种情况下,Django 将
自动对站点地图进行分页,
索引将反映这一点。
您认为启用缓存站点地图的最佳方式是什么?
- 侵入 Django 站点地图框架以将单个站点地图大小限制为(比方说)10,000 条记录似乎是最好的主意。为什么首先选择 50,000 人?谷歌建议?随机数?
- 或者也许有一种方法可以让 Memcached 存储更大的文件?
- 或者也许一旦保存,站点地图就应该作为静态文件提供?这意味着我必须手动将结果存储在文件系统中,并在下次请求站点地图时从那里检索结果(可能每天在 cron 作业中清理目录),而不是使用 Memcached 进行缓存。
所有这些看起来都非常低水平,我想知道是否存在明显的解决方案......
I implemented a simple sitemap class using Django's default sitemap application. As it was taking a long time to execute, I added manual caching:
class ShortReviewsSitemap(Sitemap):
changefreq = "hourly"
priority = 0.7
def items(self):
# Try to retrieve from cache
result = get_cache(CACHE_SITEMAP_SHORT_REVIEWS, "sitemap_short_reviews")
if result!=None:
return result
result = ShortReview.objects.all().order_by("-created_at")
# Store in cache
set_cache(CACHE_SITEMAP_SHORT_REVIEWS, "sitemap_short_reviews", result)
return result
def lastmod(self, obj):
return obj.updated_at
The problem is that Memcached allows only maximum a 1 MB object. This one was bigger that 1 MB, so storing it into the cache failed:
>7 SERVER_ERROR object too large for cache
The problem is that Django has an automated way of deciding when it should divide the sitemap file into smaller ones. According to the documentation:
You should create an index file if one
of your sitemaps has more than 50,000
URLs. In this case, Django will
automatically paginate the sitemap,
and the index will reflect that.
What do you think would be the best way to enable caching sitemaps?
- Hacking into Django sitemaps framework to restrict a single sitemap size to, let's say, 10,000 records seems like the best idea. Why was 50,000 chosen in the first place? Google advice? Random number?
- Or maybe there is a way to allow Memcached to store bigger files?
- Or perhaps once saved, the sitemaps should be made available as static files? This would mean that instead of caching with Memcached I'd have to manually store the results in the filesystem and retrieve them from there next time when the sitemap is requested (perhaps cleaning the directory daily in a cron job).
All those seem very low level and I'm wondering if an obvious solution exists...
发布评论
评论(4)
50k 不是硬编码参数。
您可以使用类 django.contrib.sitemaps.GenericSitemap 代替:
50k is not a hard coded parameter.
You can use class django.contrib.sitemaps.GenericSitemap instead:
您还可以以 gzip 格式提供站点地图,这使得它们小得多。 XML 非常适合 gzip 压缩。我有时会做的事情:在 cronjob 中创建 gzip 压缩的站点地图文件,并根据需要经常渲染它们。通常,每天一次就足够了。其代码可能如下所示。只需确保从您的域根目录提供 sitemap.xml.gz 即可:
这应该可以帮助您开始。
You can serve sitemaps also in gzip format, which makes them a lot smaller. XML is suited perfectly for gzip compression. What I sometimes do: Create the gzipped sitemap file(s) in a cronjob and render them as often as necessary. Usually, once a day will suffice. The code for this may look like this. Just make sure to have your sitemap.xml.gz served from your domain root:
This should get you started.
假设您不需要站点地图中的所有这些页面,那么减少限制以减小文件大小将可以正常工作,如前面的答案中所述。
如果您确实想要一个非常大的站点地图并且确实想要使用 Memcached,您可以将内容分成多个块,将它们存储在单独的键下,然后在输出时将它们重新组合在一起。为了提高效率,Memcached 支持同时获取多个键的功能,尽管我不确定 Django 客户端是否支持此功能。
作为参考,1 MB 限制是 Memcached 的一项功能,涉及其存储数据的方式:http://code.google.com/p/memcached/wiki/FAQ#What_is_the_maximum_data_size_you_can_store?_(1_megabyte)
Assuming you don't need all those pages in your sitemap then reducing the limit to get the file size down will work fine as described in the previous answer.
If you do want a very large sitemap and do want to use Memcached you could split the content up into multiple chunks, store them under individual keys and then put them back together again on output. To make this more efficient, Memcached supports the ability to get multiple keys at the same time, although I'm not sure whether the Django client supports this capability yet.
For reference, the 1 MB limit is a feature of Memcached to do with how it stores data: http://code.google.com/p/memcached/wiki/FAQ#What_is_the_maximum_data_size_you_can_store?_(1_megabyte)
我的网站上有大约 200,000 个页面,所以无论如何我都必须有索引。我最终做了黑客,将站点地图限制为250个链接,并且还实现基于文件的缓存。
基本算法是这样的:
最终结果是,第一次请求站点地图时,如果完整,则会生成站点地图并将其保存到磁盘。下次请求时,只需从磁盘提供即可。由于我的内容永远不会改变,所以效果非常好。但是,如果我确实想更改站点地图,只需从磁盘中删除文件,然后等待爬虫重新生成内容即可。
如果您有兴趣,整个代码就在这里: http://bitbucket.org/mlissner/legal-current-awareness/src/tip/alert/alertSystem/sitemap.py
也许这对您来说也是一个很好的解决方案。
I have about 200,000 pages on my site, so I had to have the index no matter what. I ended up doing the hack, limiting the sitemap to 250 links, and also implementing a file-based cache.
The basic algorithm is this:
The end result is that the first time a sitemap is requested, if it's complete, it's generated and saved to disk. The next time it's requested, it's simply served from disk. Since my content never changes, this works very well. However, if I do want to change a sitemap, it's as simple as deleting the file(s) from disk, and waiting for the crawlers to come regenerate things.
The code for the whole thing is here, if you're interested: http://bitbucket.org/mlissner/legal-current-awareness/src/tip/alert/alertSystem/sitemap.py
Maybe this will be a good solution for you too.