大型 App Engine 网站的站点地图结构
我正在考虑构建大型 App Engine 网站(+1M 网址)的最佳方法。
我需要在域文件的根路径中有一个 sitemaps.xml 文件来链接到 sitemap[n].xml 文件。
sitemaps.xml 文件最多可以链接 1000 个 sitemap[n].xml 文件,并且每个 sitemap[n].xml 文件最多可包含 50K 个 URL。
有没有办法动态生成 50K url 的文件?
还有其他方法可以在不一次获取 50K 实体的情况下做到这一点吗?
谢谢!
PS:文件不能是静态的,因为它们必须放置在域的根路径中:(
I'm thinking on the best way to structure a large App Engine site (+1M urls).
I need a sitemaps.xml file in the root path of the domain file that links to sitemap[n].xml files.
The sitemaps.xml file can link up to 1000 sitemap[n].xml files and each of these sitemap[n].xml files has up to 50K urls.
Is there a way to dynamically generate the files with the 50K urls?
Any other way to do it without fetching 50K entities at a time?
Thanks!
PS: The files cannot be static because they have to be placed in the root path of the domain :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
最好的办法是提前生成它们。也许可以对数据运行映射缩减,并将每个
sitemap[n].xml
存储在单独数据存储实体中的 blob 中。然后处理程序(从- url: /sitemap(.*)
映射)只需从相应的实体返回 blob。所有这一切实际上取决于您的网址如何存储和/或生成。
您还可以离线生成所有网址并将它们放入一个大文件中。将该文件连同包含该文件中每组 50k url 的偏移量的文件一起上传到 blobstore。在处理程序中,只需从 blobstore 中获取相应的 50k url 组即可。
还要意识到拥有如此巨大的站点地图可能没有那么有用(就搜索引擎优化而言)。
You're best bet is to generate them ahead of time. Maybe run a map-reduce over your data and store each
sitemap[n].xml
in a a blob in a separate datastore entity. Then the handler (which is mapped from- url: /sitemap(.*)
) simply returns the blob from the corresponding entity.All of this really depends on how your urls are stored and/or generated.
You could also generate all the urls offline and put them in one huge file. Upload that file it to the blobstore along with a file that has the offsets for each group of 50k urls in that file. In the handler, simply take the corresponding group of 50k urls from the blobstore.
Also realize that it's probably not that useful (with respect to SEO) to have such huge sitemaps.
为什么不能在 app.yaml 中添加一个条目来重定向文件的去向。 Robots.txt 应该位于根级别,但我将其保留在 /img 中,
这对于任何爬虫来说都是完全相同的。
Why can't you add an entry in your app.yaml to redirect where the files go. Robots.txt should be in the root level but I keep it in /img
It is the exact same to any crawler.