在站点地图中,是否建议包含网站上每个页面的链接,还是仅包含需要的页面的链接?
我正在为我的网站创建站点地图。我这样做是因为我有大量页面,用户通常只能通过搜索表单访问这些页面。
我创建了一种自动方法,用于从数据库中提取链接并将其编译到站点地图中。但是,对于所有可定期访问且不在数据库中的页面,我必须手动浏览并将它们添加到站点地图中。
让我印象深刻的是,常规页面是普通爬虫无论如何都能找到的页面,因此手动添加这些页面,然后确保站点地图在对其进行的任何更改上保持最新状态似乎很麻烦。
如果它们已经被索引,并且我的站点地图仅包含我的动态页面,那么将它们排除在外是不是很糟糕?
I'm in the process of creating a sitemap for my website. I'm doing this because I have a large number of pages that can only be reached via a search form normally by users.
I've created an automated method for pulling the links out of the database and compiling them into a sitemap. However, for all the pages that are regularly accessible, and do not live in the database, I would have to manually go through and add these to the sitemap.
It strikes me that the regular pages are those that get found anyway by ordinary crawlers, so it seems like a hassle manually adding in those pages, and then making sure the sitemap keeps up to date on any changes to them.
Is it a bad to just leave those out, if they're already being indexed, and have my sitemap only contain my dynamic pages?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Google 将抓取其发现的任何网址(由 robots.txt 允许),即使它们不在站点地图中。只要您的静态页面都可以从站点地图中的其他页面访问,就可以排除它们。但是,站点地图 XML 的其他功能可能会激励您在站点地图中包含静态 URL(例如修改日期和优先级)。
如果您愿意编写一个脚本来自动为数据库条目生成站点地图,那么请更进一步,让您的脚本也为静态页面生成条目。这可能就像搜索 webroot 并查找 *.html 文件一样简单。或者,如果您正在使用框架,请迭代框架的静态路由。
Google will crawl any URLs (as allowed by robots.txt) it discovers, even if they are not in the sitemap. So long as your static pages are all reachable from the other pages in your sitemap, it is fine to exclude them. However, there are other features of sitemap XML that may incentivize you to include static URLs in your sitemap (such as modification dates and priorities).
If you're willing to write a script to automatically generate a sitemap for database entries, then take it one step further and make your script also generate entries for static pages. This could be as simple as searching through the webroot and looking for *.html files. Or if you are using a framework, iterate over your framework's static routes.
是的,我认为把它们排除在外并不是一件好事。我认为寻找一种无需站点地图即可让爬虫找到您的搜索页面的方法也是明智的。例如,您可以添加某种高级搜索页面,用户可以在其中以表单形式选择搜索词。爬虫也可以填写这些表格。
Yes, I think it is not a good to leave them out. I think it would also be advisable to look for a way that your search pages can be found by a crawler without a sitemap. For example, you could add some kind of advanced search page where a user can select in a form the search term. Crawlers can also fill in those forms.