在 Google App Engine 上创建大型站点地图?
我有一个包含大约 100,000 个独特页面的网站。
(1) 如何为所有这些链接创建站点地图?我应该将它们平铺在一个大型站点地图协议兼容文件中吗?
(2) 需要在 Google App Engine 上实现此功能,其中有 1000 个项目查询限制,并且我所有的个人站点 URL 都存储为单独的条目。我该如何解决这个问题?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
站点地图不得大于 10MB,列出的 URL 不得超过 50,000 个,因此您需要以某种方式将其分解。
您将需要某种分片策略。我不知道你的数据是什么样的,所以现在假设每次你创建一个页面实体时,你都会为其分配一个 1 到 500 之间的随机整数
。 google.com/support/webmasters/bin/answer.py?hl=zh-CN&answer=71453" rel="noreferrer">站点地图索引,并为每个索引值生成一个站点地图链接:
最后,在站点地图页面上,查询页面并过滤随机索引。如果您有 100,000 个页面,则每个站点地图将提供大约 200 个 URL。
这里稍微不同的策略是为每个页面提供一个自动递增的数字 ID。为此,您需要一个计数器对象,该对象以事务方式锁定并在每次创建新页面时递增。这样做的缺点是您无法并行创建新页面实体。好处是您可以更好地控制页面的布局,因为您的第一个站点地图可能是第 1-1000 页,依此类推。
Site Maps must be no larger than 10MB and list no more than 50,000 URLs, so you're going to need to break it up somehow.
You're going to need some kind of sharding strategy. I don't know what your data looks like, so for now let's say every time you create a page entity, you assign it a random integer between 1 and 500.
Next, create a Sitemap index, and spit out a sitemap link for each of your index values:
Finally, on your sitemap page, query for pages and filter for your random index. If you have 100,000 pages this will give you about 200 URLs per sitemap.
A slightly different strategy here would be to give each page an auto-incrementing numeric ID. To do so, you need a counter object that is transactionally locked and incremented each time a new page is created. The downside of this is that you can't parallelize creation of new page entities. The upside is that you would have a bit more control over how your pages are laid out, as your first sitemap could be pages 1-1000, and so on.
您可以使用查询游标来规避 1000 查询物品限制;不过,即使使用游标也可能无法完全解决您的问题,因为生成包含 100,000 个项目的站点地图很容易超出允许运行单个请求的时间。此外,动态生成站点地图可能会轻松耗尽全部或大量资源配额。
如果您的数据不是很动态,我会考虑生成静态站点地图文件并将其作为部署包的一部分。即使您的数据非常动态,您也可能希望采用每天仅重新生成一次的策略,并进行部署以将其放在服务器上。
You can use Query Cursors to circumvent the 1000 query item limit; although, even using cursors probably won't entirely solve your problem, as generating a sitemap with 100,000 items in it could easily exceed the amount of time that a single request is allowed to run. Also, generating the sitemap dynamically could easily use up all or a large amount of your resource quota.
If your data is not very dynamic, I would consider generating a static sitemap file and including it as part of your deployment package. Even if your data is very dynamic, you probably want to adopt a strategy of regenerating it only once per day and doing a deployment to put it up on the server.
我遇到了类似的问题,但为了重新发明轮子,我只是插入了 Google Sitemap Generator http://sitemap-generators.googlecode.com/svn/trunk/docs/en/sitemap-generator.html 。它对我有用,因为我的应用程序是基于 python 的。
I had a similar issue but instead to reinvent the wheel I just plugged-in the Google Sitemap Generator http://sitemap-generators.googlecode.com/svn/trunk/docs/en/sitemap-generator.html . It worked for me as my app is python based.