需要帮助追踪偶发的“无法序列化会话状态”问题服务器错误

发布于 2024-09-27 04:43:42 字数 2609 浏览 13 评论 0原文

我们定期收到用户关于以下服务器错误的报告。

[OutOfMemoryException: Exception of type System.OutOfMemoryException was thrown.]
[HttpException (0x80004005): Unable to serialize the session state. Please note that non-serializable objects or MarshalByRef objects are not permitted when session state mode is ‘StateServer’ or ‘SQLServer’

一旦处于出现此错误的状态，该错误是否可以在本地重现就显得很不确定。如果是的话，那么我们通常可以在几分钟内重现它们，但不能在每个点击的页面上重现它们。这种情况通常会自行消失，并在我们重新与用户联系时自行解决。

Web 服务在工作时间内大约有 90-100 个活动连接。该服务器上唯一的其他站点是该站点的临时版本，该站点很少受到访问。会话状态与应用程序数据库存储在同一 SQLServer 实例上，该实例位于相当大的虚拟机集群上。在此过程中，Web 服务器或 SQLServer 似乎都没有受到负担（无论是处理器还是内存方面）。

出现错误的页面的分布似乎与每个页面的正态分布相当。就发生时间而言似乎没有任何模式。平均而言，周末的错误确实较少（这与正常的网站负载相关），但即使如此，似乎也不一致。

记录的错误与记录的任何类型的性能监视器事件之间似乎也没有关联。这包括一系列性能监视器计数器，其中包括：

.NET CLR Jit(w3wp)\notal # of IL Bytes Jitted  
.NET CLR Jit(w3wp)\IL Bytes Jitted / sec  
.NET CLR Jit(w3wp)\% Time in Jit  
.NET CLR Jit(w3wp)\# of Methods Jitted  
.NET CLR Jit(w3wp)\# of IL Bytes Jitted  
ASP.NET Apps v1.1.4322(__Total__)\Requests Failed  
ASP.NET Apps v1.1.4322(__Total__)\Errors Unhandled During Execution/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Errors Unhandled During Execution  
ASP.NET Apps v1.1.4322(__Total__)\Cache Total Turnover Rate  
ASP.NET Apps v1.1.4322(__Total__)\Errors During Preprocessing  
ASP.NET Apps v1.1.4322(__Total__)\Errors During Execution  
ASP.NET Apps v1.1.4322(__Total__)\Requests Executing  
ASP.NET Apps v1.1.4322(__Total__)\Requests Total  
ASP.NET Apps v1.1.4322(__Total__)\Errors Total  
ASP.NET Apps v1.1.4322(__Total__)\Sessions Abandoned  
ASP.NET Apps v1.1.4322(__Total__)\Errors Total/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Anonymous Requests/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Requests/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Session SQL Server connections total  
ASP.NET Apps v1.1.4322(__Total__)\Cache Total Hit Ratio  
ASP.NET v1.1.4322\Requests Current  
ASP.NET v1.1.4322\Request Execution Time  
Memory\Pages/sec  
Bytes Total/sec  
PhysicalDisk(_Total)\Avg. Disk Queue Length  
Processor(_Total)\% Processor Time  
Web Service Cache\File Cache Hits %  
Web Service Cache\File Cache Misses  
Web Service Cache\File Cache Hits  
Web Service(_Total)\Current Connections  
Web Service(_Total)\Post Requests/sec)

我在日志中看到的唯一模式与这些错误的发生无关，但却是我能看到的唯一模式。查看 perfmon 日志，我们看到一种模式，其中“Jitted 的 IL 字节总数”、“Jitted IL 字节数/秒”、“Jit 时间百分比”、“Jitted 方法数”和“Jitted IL 字节数” ” 暂存站点的计数器（不应获得任何流量）在 20-50 分钟内不会提取数据，此后“IL Bytes Jitted / sec”立即出现峰值，并且“% Time”出现跳跃in Jit” 2-20 分钟，主站点的利用率高达 99%。

如果有人对可能导致此问题的原因有任何想法，或者有类似问题的经验，我将不胜感激。

谢谢！

原文

We have been receiving reports of the following server error periodically from users.

[OutOfMemoryException: Exception of type System.OutOfMemoryException was thrown.]
[HttpException (0x80004005): Unable to serialize the session state. Please note that non-serializable objects or MarshalByRef objects are not permitted when session state mode is ‘StateServer’ or ‘SQLServer’

Once in a state where this error appears, it appears to be hit or miss whether the errors are reproducible locally. If they are, then we can usually reproduce them for a couple minutes, but not on every page hit. This usually tapers off on its own and usually has resolved itself by the time we get back in contact with the users.

The Web Service has around 90-100 active connections during business hours. The only other site on this server is the staging version of this site, which gets hit very infrequently. The Session State is stored on the same SQLServer instance as the application database which is housed on a fairly large cluster of virtual machines. Neither the Web Server or the SQLServer seemed to be taxed (either processor or memory-wise) while this is going on.

The distribution of which pages are erroring seems to be comparable to the normal distribution for each page. There doesn't appear to be any pattern in terms of times of occurrence. We do have less errors on average on weekends (which correlates to normal site load), but even this appears to not be consistent.

There also doesn't appear to be a correlation between the errors logged and any kind of logged performance monitor events. This includes an array of perfmon counters including:

.NET CLR Jit(w3wp)\notal # of IL Bytes Jitted  
.NET CLR Jit(w3wp)\IL Bytes Jitted / sec  
.NET CLR Jit(w3wp)\% Time in Jit  
.NET CLR Jit(w3wp)\# of Methods Jitted  
.NET CLR Jit(w3wp)\# of IL Bytes Jitted  
ASP.NET Apps v1.1.4322(__Total__)\Requests Failed  
ASP.NET Apps v1.1.4322(__Total__)\Errors Unhandled During Execution/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Errors Unhandled During Execution  
ASP.NET Apps v1.1.4322(__Total__)\Cache Total Turnover Rate  
ASP.NET Apps v1.1.4322(__Total__)\Errors During Preprocessing  
ASP.NET Apps v1.1.4322(__Total__)\Errors During Execution  
ASP.NET Apps v1.1.4322(__Total__)\Requests Executing  
ASP.NET Apps v1.1.4322(__Total__)\Requests Total  
ASP.NET Apps v1.1.4322(__Total__)\Errors Total  
ASP.NET Apps v1.1.4322(__Total__)\Sessions Abandoned  
ASP.NET Apps v1.1.4322(__Total__)\Errors Total/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Anonymous Requests/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Requests/Sec  
ASP.NET Apps v1.1.4322(__Total__)\Session SQL Server connections total  
ASP.NET Apps v1.1.4322(__Total__)\Cache Total Hit Ratio  
ASP.NET v1.1.4322\Requests Current  
ASP.NET v1.1.4322\Request Execution Time  
Memory\Pages/sec  
Bytes Total/sec  
PhysicalDisk(_Total)\Avg. Disk Queue Length  
Processor(_Total)\% Processor Time  
Web Service Cache\File Cache Hits %  
Web Service Cache\File Cache Misses  
Web Service Cache\File Cache Hits  
Web Service(_Total)\Current Connections  
Web Service(_Total)\Post Requests/sec)

The only pattern I can see in the logs doesn't correlate to the occurrence of these errors, but is the only pattern I can see. Looking at the perfmon logs we are seeing a pattern where the "Total # of IL Bytes Jitted", "IL Bytes Jitted / sec", "% Time in Jit", "# of Methods Jitted", and "# of IL Bytes Jitted" counters for the staging site (which shouldn't be getting any traffic) doesn't pull data for a 20-50 minute period after which there is an immediate spike in "IL Bytes Jitted / sec" and a jump in "% Time in Jit" for 2-20 minute of up to 99% for the main site.

If anyone has any ideas on what could be causing this, or has had experience with a similar issue I would be grateful for any input.

Thanks!

分享到QQ

分享到微博