为什么请求在被 googlebot 抓取时抛出异常，但在粘贴 URL 时却不会抛出异常？

发布于 2024-12-20 20:23:20 字数 2256 浏览 6 评论 0原文

我的事件日志中收到了大量此类异常。

EVENT ID: 1309

Event code: 3005 
Event message: An unhandled exception has occurred. 
Event time: 12/12/2011 1:40:41 PM 
Event time (UTC): 12/12/2011 8:40:41 PM 
Event ID: f85f113a40d349f5a1fe9ef481038281 
Event sequence: 8993 
Event occurrence: 1463 
Event detail code: 0 

Application information: 
    Application domain: /LM/W3SVC/12/ROOT-1-129681577057031250 
    Trust level: Full 
    Application Virtual Path: / 
    Application Path: C:\inetpub\wwwroot\gouki\ 
    Machine name: GOUKIPRIME 

Process information: 
    Process ID: 7508 
    Process name: w3wp.exe 
    Account name: IIS APPPOOL\gouki 

Exception information: 
    Exception type: HttpException 
    Exception message: A potentially dangerous Request.Path value was detected from the client (?).
   at System.Web.HttpRequest.ValidateInputIfRequiredByConfig()
   at System.Web.HttpApplication.PipelineStepManager.ValidateHelper(HttpContext context)



Request information: 
    Request URL: http://gouki.com/Story/?page=8&orderby=views&tagged=&subject=&author=?page=10&orderby=views,views,views,&tagged=,,,,,,,,,,,,&subject=,,,,,,,,,,,,,,,,,,&author=,,,,,,,,,,,,,, 
    Request path: /Story/?page=8&orderby=views&tagged=&subject=&author= 
    User host address: 66.249.68.81 
    User:  
    Is authenticated: False 
    Authentication Type:  
    Thread account name: IIS APPPOOL\gouki 

Thread information: 
    Thread ID: 142 
    Thread account name: IIS APPPOOL\gouki 
    Is impersonating: False 
    Stack trace:    at System.Web.HttpRequest.ValidateInputIfRequiredByConfig()
   at System.Web.HttpApplication.PipelineStepManager.ValidateHelper(HttpContext context)


Custom event details: 

Connection: Keep-alive
Accept: */*
Accept-Encoding: gzip,deflate
From: googlebot(at)googlebot.com
Host: gouki.com
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

我不确定 googlebot 在哪里获取了格式错误的 URL（我尝试在我的网站上重现但无济于事），但我更好奇的是，当我自己复制/粘贴 URL（继续尝试），我没有收到任何错误。是的，该页面有些损坏，因为参数值没有意义，我可以理解为什么双问号可能会导致问题，但没有抛出异常。我尝试将我的用户代理更改为 googlebot，但仍然没有看到错误。

由于某种原因 Asp.net MVC 看到了第一个？作为路径的一部分，而不是查询字符串的开头，但仅当 googlebot 请求页面时。

这里是否发生了某种我在事件日志中没有看到的转义？

原文

I've been getting a ton of these exceptions in my event log.

EVENT ID: 1309

Event code: 3005 
Event message: An unhandled exception has occurred. 
Event time: 12/12/2011 1:40:41 PM 
Event time (UTC): 12/12/2011 8:40:41 PM 
Event ID: f85f113a40d349f5a1fe9ef481038281 
Event sequence: 8993 
Event occurrence: 1463 
Event detail code: 0 

Application information: 
    Application domain: /LM/W3SVC/12/ROOT-1-129681577057031250 
    Trust level: Full 
    Application Virtual Path: / 
    Application Path: C:\inetpub\wwwroot\gouki\ 
    Machine name: GOUKIPRIME 

Process information: 
    Process ID: 7508 
    Process name: w3wp.exe 
    Account name: IIS APPPOOL\gouki 

Exception information: 
    Exception type: HttpException 
    Exception message: A potentially dangerous Request.Path value was detected from the client (?).
   at System.Web.HttpRequest.ValidateInputIfRequiredByConfig()
   at System.Web.HttpApplication.PipelineStepManager.ValidateHelper(HttpContext context)



Request information: 
    Request URL: http://gouki.com/Story/?page=8&orderby=views&tagged=&subject=&author=?page=10&orderby=views,views,views,&tagged=,,,,,,,,,,,,&subject=,,,,,,,,,,,,,,,,,,&author=,,,,,,,,,,,,,, 
    Request path: /Story/?page=8&orderby=views&tagged=&subject=&author= 
    User host address: 66.249.68.81 
    User:  
    Is authenticated: False 
    Authentication Type:  
    Thread account name: IIS APPPOOL\gouki 

Thread information: 
    Thread ID: 142 
    Thread account name: IIS APPPOOL\gouki 
    Is impersonating: False 
    Stack trace:    at System.Web.HttpRequest.ValidateInputIfRequiredByConfig()
   at System.Web.HttpApplication.PipelineStepManager.ValidateHelper(HttpContext context)


Custom event details: 

Connection: Keep-alive
Accept: */*
Accept-Encoding: gzip,deflate
From: googlebot(at)googlebot.com
Host: gouki.com
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

I'm not sure where googlebot is picking up the malformed URL (I've tried to no avail to repro on my site), but what I'm more curious about is why this exception is being logged to the event log when if I copy/paste the URL myself (go on, try it), I get no error. Yeah the page is somewhat broken since the parameter values make no sense, and I can see why dual question marks could cause issues, but there is no exception thrown. I've tried changing my user agent to the googlebot, and I still don't see the error.

For some reason Asp.net MVC is seeing the first ? as part of the path and not the start of the query string, but only when googlebot is requesting the page.

Is there some sort of escaping going on here that I'm not seeing in the event log?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

被你宠の有点坏 2024-12-27 20:23:20

请注意这一点：

Request path: /Story/?page=8&orderby=views&tagged=&subject=&author=

服务器认为查询字符串参数是页面名称的一部分，这可能意味着第一个问号实际上是使用 %3f 转义的，但在错误消息中并未以这种方式显示。问号作为查询字符串的分隔符是有效的，但不能作为页面名称的一部分。

该机器人已在某处获取了该 URL，并且可能尝试修复它。确保您已正确转义 URL，即当 URL 位于 HTML 元素的属性中时，& 应为 &。

如果您的页面中有类似 ?page=8&orderby=views&tagged=&subject=&author= 的相对链接，机器人可能会尝试将其与当前页面 URL，这将解释双组查询字符串。这通常应该有效，但如果 URL 转义出现问题，可能会造成混乱。

Notice this:

Request path: /Story/?page=8&orderby=views&tagged=&subject=&author=

The server thinks that the query string parameters is part of the page name, which probably means that the first question mark is actually escaped using %3f, but not shown that way in the error message. A question mark is valid as a separator for the query string, but not as part of the page name.

The bot has picked up the URL somewhere, and perhaps tried to fix it. Make sure that you have escaped the URLs properly, i.e. the & should be & when the URL is in an attribute in an HTML element.

If you have a relative link like ?page=8&orderby=views&tagged=&subject=&author= in your page, the bot might try to make a complete URL by combining it with the current page URL, which would explain the double sets of query strings. This should normally work, but if there is some problem with the escaping of the URL, it might mess it up.

回复收藏 0 原文