在 robots.txt 中指定变量
我的 URL 结构设置为两个并行(都通向同一个地方):
www.example.com/subname
www.example.com/123
问题是也就是说,蜘蛛会爬入以下内容:
www.example.com/subname/default_media_function
www.example.com/subname/map_function
请注意,名称“subname”代表我的网站上有数千个不同的页面,它们都具有相同的功能。
他们抛出错误,因为这些链接严格用于 JSON 或 AJAX 目的,而不是实际链接。我想阻止他们访问这些页面,但如果 URL 包含变量,我该怎么做?
这在 robots.txt 中有效吗?
Disallow: /map_function
My URL structure is set up in two parallels (both lead to the same place ):
www.example.com/subname
www.example.com/123
The trouble is is that, the spiders are crawling into things like:
www.example.com/subname/default_media_function
www.example.com/subname/map_function
Note that the name "subname" represents thousands of different pages on my site that all have that same function.
And they are throwing out errors because those links are strictly for JSON or AJAX purposes and not actual links. I would like to block them from accessing those pages, but how would I do that if the URL contains a variable?
Would this work in robots.txt?
Disallow: /map_function
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您必须执行以下操作:
机器人将在根级别查找robots.txt。此外,它们从左到右评估 URL,不带通配符。
因此,您要么需要为所有 map_function 指定一个位置并将其排除,要么排除所有位置。
You are going to have to do
The robots will look for the robots.txt at root level. Also there they evaluate URLs left to right with no wildcards.
So, you will either need to make one location for all the map_function and exclude that, or exclude all locations.