SQL Server 锁定/挂起问题
我在 Windows Server 2008 R2 上使用 SQL Server 2008,全部加速。
我偶尔会遇到 SQL Server 挂起的问题,并且我们的实时服务器上的 CPU 使用率为 100%。发生这种情况时,SQL Sever 上的所有等待时间似乎都给了 SOS_SCHEDULER_YIELD。
这是导致挂起的存储过程。我添加了“WITH (NOLOCK)”,试图解决似乎是锁定的问题。
ALTER PROCEDURE [dbo].[MostPopularRead]
AS
BEGIN
SET NOCOUNT ON;
SELECT
c.ForeignId , ct.ContentSource as ContentSource
, sum(ch.HitCount * hw.Weight) as Popularity
, (sum(ch.HitCount * hw.Weight) * 100) / @Total as Percent
, @Total as TotalHits
from
ContentHit ch WITH (NOLOCK)
join [Content] c WITH (NOLOCK) on ch.ContentId = c.ContentId
join HitWeight hw WITH (NOLOCK) on ch.HitWeightId = hw.HitWeightId
join ContentType ct WITH (NOLOCK) on c.ContentTypeId = ct.ContentTypeId
where
ch.CreatedDate between @Then and @Now
group by
c.ForeignId , ct.ContentSource
order by
sum(ch.HitCount * hw.HitWeightMultiplier) desc
END
存储的过程从表“ContentHit”中读取,该表跟踪网站上的内容何时被单击(它被频繁点击 - 每分钟 4 到 20 次点击)。所以很明显这张表就是问题的根源。有一个存储过程被调用来将命中轨道添加到 ContentHit 表中,它非常简单,它只是根据传入的参数构建一个字符串,其中涉及从一些查找表中进行一些选择,然后是主要插入
BEGIN TRAN
insert into [ContentHit]
(ContentId, HitCount, HitWeightId, ContentHitComment)
values
(@ContentId, isnull(@HitCount,1), isnull(@HitWeightId,1), @ContentHitComment)
COMMIT TRAN
: ContentHit 表在其 ID 列上有一个聚集索引,并且我在 CreatedDate 上添加了另一个索引,因为该索引在选择中使用。
当我分析问题时,我看到存储过程执行了整整 30 秒,然后发生 SQL 超时异常。如果它有所不同的话,使用它的 Web 应用程序是 ASP.NET,并且我使用 Subsonic (3) 来执行这些存储过程。
有人可以告诉我如何最好地解决这个问题吗?我不关心读取脏数据...
编辑: MostPopularRead 存储过程的调用频率非常低 - 它在站点的主页上调用,但结果会缓存一天。我看到的事件模式是当我清除缓存时,主站点收到多个请求,并且它们都到达存储过程,因为它尚未被缓存。然后SQL Server达到最大,只能通过重新启动SQL Server进程来解决。当我这样做时,通常 proc 会正常执行(大约 200 毫秒)并将数据放回到缓存中。
编辑2: 我检查了执行计划,查询看起来很合理。正如我之前所说,当它运行时,只需要大约 200 毫秒即可执行。我已将 MAXDOP 1 添加到 select 语句中,以强制它仅使用一个 CPU 核心,但我仍然看到该问题。当我查看等待时间时,我发现 XE_DISPATCHER_WAIT、ONDEMAND_TASK_QUEUE、BROKER_TRANSMITTER、KSOURCE_WAKEUP 和 BROKER_EVENTHANDLER 占用了大量等待时间。
编辑3: 我之前认为这与我们的 ORM Subsonic 有关,但切换到 ADO.NET 后,错误仍然存在。
I'm using SQL Server 2008 on Windows Server 2008 R2, all sp'd up.
I'm getting occasional issues with SQL Server hanging with the CPU usage on 100% on our live server. It seems all the wait time on SQL Sever when this happens is given to SOS_SCHEDULER_YIELD.
Here is the Stored Proc that causes the hang. I've added the "WITH (NOLOCK)" in an attempt to fix what seems to be a locking issue.
ALTER PROCEDURE [dbo].[MostPopularRead]
AS
BEGIN
SET NOCOUNT ON;
SELECT
c.ForeignId , ct.ContentSource as ContentSource
, sum(ch.HitCount * hw.Weight) as Popularity
, (sum(ch.HitCount * hw.Weight) * 100) / @Total as Percent
, @Total as TotalHits
from
ContentHit ch WITH (NOLOCK)
join [Content] c WITH (NOLOCK) on ch.ContentId = c.ContentId
join HitWeight hw WITH (NOLOCK) on ch.HitWeightId = hw.HitWeightId
join ContentType ct WITH (NOLOCK) on c.ContentTypeId = ct.ContentTypeId
where
ch.CreatedDate between @Then and @Now
group by
c.ForeignId , ct.ContentSource
order by
sum(ch.HitCount * hw.HitWeightMultiplier) desc
END
The stored proc reads from the table "ContentHit", which is a table that tracks when content on the site is clicked (it gets hit quite frequently - anything from 4 to 20 hits a minute). So its pretty clear that this table is the source of the problem. There is a stored proc that is called to add hit tracks to the ContentHit table, its pretty trivial, it just builds up a string from the params passed in, which involves a few selects from some lookup tables, followed by the main insert:
BEGIN TRAN
insert into [ContentHit]
(ContentId, HitCount, HitWeightId, ContentHitComment)
values
(@ContentId, isnull(@HitCount,1), isnull(@HitWeightId,1), @ContentHitComment)
COMMIT TRAN
The ContentHit table has a clustered index on its ID column, and I've added another index on CreatedDate since that is used in the select.
When I profile the issue, I see the Stored proc executes for exactly 30 seconds, then the SQL timeout exception occurs. If it makes a difference the web application using it is ASP.NET, and I'm using Subsonic (3) to execute these stored procs.
Can someone please advise how best I can solve this problem? I don't care about reading dirty data...
EDIT:
The MostPopularRead stored proc is called very infrequently - its called on the home page of the site, but the results are cached for a day. The pattern of events that I am seeing is when I clear the cache, multiple requests come in for the home site, and they all hit the stored proc because it hasn't yet been cached. SQL Server then maxes out, and can only be resolved by restarting the sql server process. When I do this, usually the proc will execute OK (in about 200 ms) and put the data back in the cache.
EDIT 2:
I've checked the execution plan, and the query looks quite sound. As I said earlier when it does run it only takes around 200ms to execute. I've added MAXDOP 1 to the select statement to force it to use only one CPU core, but I still see the issue. When I look at the wait times I see that XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER are taking up a massive amount of wait time.
EDIT 3:
I previously thought that this was related to Subsonic, our ORM, but having switched to ADO.NET, the erros is still live.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
删除 NOLOCK 提示。
在 SSMS 中打开查询,运行 SET STATISTICSIO ON 并在过程中运行查询。让它完成并在此处发布 IO 统计消息。然后发布表定义及其上定义的所有索引。然后有人将能够回复您所需的正确索引。
与所有 SQL 性能问题一样,如果没有完整的模式定义,查询文本在很大程度上是无关紧要的。
估计覆盖索引为:
更新
XE_DISPATCHER_WAIT
、ONDEMAND_TASK_QUEUE
、BROKER_TRANSMITTER
、KSOURCE_WAKEUP
和BROKER_EVENTHANDLER
:您可以安全地忽略所有这些等待。它们之所以出现,是因为它们代表已停放并等待分派 XEvent、Service Broker 或内部 SQL 线程池工作项的线程。由于他们大部分时间都花在停车和等待上,因此他们的等待时间不切实际。忽略他们。Remove the NOLOCK hint.
Open a query in SSMS, run
SET STATISTICSIO ON
and run the query in the procedure. Let it finish and post here the IO stats messages. Then post the table definitions and all indexes defined on them. Then somebody will be able to reply with the proper indexes you need.As with all SQL performance problem, the text of the query is largely irrelevant without complete schema definition.
A guesstimate covering index would be:
Update
XE_DISPATCHER_WAIT
,ONDEMAND_TASK_QUEUE
,BROKER_TRANSMITTER
,KSOURCE_WAKEUP
andBROKER_EVENTHANDLER
: you can safely ignore all these waits. They show up because they represent threads parked and waiting to dispatch XEvents, Service Broker or internal SQL thread pool work items. As they spend most of their time parked and waiting, they get accounted for unrealistic wait times. Ignore them.如果您认为
ContentHit
是问题的根源,您可以添加 覆盖索引查看查询计划(如果您想确定查询中的瓶颈)。
If you believe
ContentHit
to be the source of your problem, you could add a Covering IndexTake a look at the Query Plan if you want to be certain about the bottleneck in your query.
默认情况下,sql server 使用所有核心/CPU 进行所有查询(最大 DoP 设置>高级属性,DoP= 并行度),即使只有一个核心实际上正在等待某些 I/O,这也可能导致 100% CPU。
如果你搜索网络或这个网站,你会发现资源解释得比我更好(比如监控你的 I/o,尽管你看到了 CPU 限制的问题)。
在一台服务器上,我们无法通过锁定所有资源 (CPU) 的错误查询来更改应用程序,但通过将 DoP 设置为核心数量的一半,我们设法避免服务器“停止”。在我们的例子中,并行性较低对查询的影响可以忽略不计。
--
多姆
By default settings sql server uses all the core/cpu for all queries (max DoP setting> advanced property, DoP= Degree of Parallelism), which can lead to 100% CPU even if only one core is actually waiting for some I/O.
If you search the net or this site you will find resource explaining it better than me (like monitoring your I/o despite you see a CPU-bound problem).
On one server we couldn't change the application with a bad query that locked down all resources (CPU) but by setting DoP to the half of the number of core we managed to avoid that the server get "stopped". The effect on the queries being less parallel was negligible in our case.
--
Dom
感谢所有发帖者,我得到了一些很棒的 SQL Server 性能调整技巧。
最终我们用完了时间来解决这个谜团——我们找到了一种更有效的方法来收集这些信息并将其缓存在数据库中,所以这为我们解决了问题。
Thanks to all who posted, I got some great SQL Server perf tuning tips.
In the end we ran out time to resolve this mystery - we found a more effecient way to collect this information and cache it in the database, so this solved the problem for us.
问题可能是并发性,而不是锁定。 SOS_SCHEDULER_YIELD 当任务自愿让出调度程序以供其他任务执行时发生。在此等待期间,任务正在等待其量程被更新。
[MostPopularRead] SP 多久调用一次以及执行需要多长时间?
查询中的聚合可能相当消耗 CPU 资源,尤其是在存在大量数据和/或无效索引的情况下。因此,您最终可能会面临较高的 CPU 压力 - 基本上,对 CPU 时间的需求过高。
我会考虑以下内容:
检查 CPU 100% 繁忙时正在执行哪些其他查询?查看 sys.dm_os_waiting_tasks、sys.dm_os_tasks、sys.dm_exec_requests。
查看[MostPopularRead]的查询计划,尝试优化查询。很多时候,无效的查询是性能问题的根本原因,而查询优化比其他性能改进技术要简单得多。
如果查询计划是并行的,并且查询经常被多个客户端同时调用,则使用 MAXDOP=1 提示强制使用单线程计划可能会有所帮助(并行计划的大量使用通常由 SOS_SCHEDULER_YIELD 和 CXPACKET 等待指示)。< /p>
另外,请查看本文: 通过等待统计信息进行性能调整。它很好地总结了不同的等待类型及其对性能的影响。
PS 在查询之前使用 SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED 比向每个表添加 (nolock) 更容易。
The issue is likely concurrency, not locking. SOS_SCHEDULER_YIELD occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed.
How often is [MostPopularRead] SP called and how long does it take to execute?
The aggregation in your query might be rather CPU-intensive, especially if there are lots of data and/or ineffective indexes. So, you might end up with high CPU pressure - basically, a demand for CPU time is too high.
I'd consider the following:
Check what other queries are executing while CPU is 100% busy? Look at sys.dm_os_waiting_tasks, sys.dm_os_tasks, sys.dm_exec_requests.
Look at the query plan of [MostPopularRead], try to optimize the query. Quite often an ineffective query is the root cause of a performance problem, and query optimization is much more straightforward than other performance improvement techniques.
If the query plan is parallel and the query is often called by multiple clients simultaneously, forcing a single-thread plan with MAXDOP=1 hint might help (abundant use of parallel plans is usually indicated by SOS_SCHEDULER_YIELD and CXPACKET waits).
Also, have a look at this paper: Performance tuning with wait statistics. It gives a pretty good summary of different wait types and their impact on performance.
P.S. It is easier to use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before a query instead of adding (nolock) to each table.