多线程& 数据库记录锁
需要很大的帮助......
我需要创建一个 .net 应用程序,它将对表中的大约 2,000,000 条记录执行一些批量操作。 应用程序应该运行并尝试处理尽可能多的行。 我在想,如果我可以让应用程序的多个线程一次获取 2000 条记录, 处理它们,它应该能够处理更多。 然而,这对数据库服务器来说将是相当昂贵的。 据我所知,数据库服务器是一台强大的机器,应该能够承受压力。
此外,由于一次仅获取 2000 行,如果应用程序在处理过程中终止,它会知道从哪里再次获取。
所以,我想我要问的是......
1)我怎样才能让应用程序拾取行和行? 锁定这些行以便它不会被分配给另一个线程?
2) 什么样的智能可以被编程到应用程序中,使其能够从上次中断的地方继续处理?
谢谢
KP
Need help big time ....
I need to create a .net application that will perform some bulk operations on , say around 2,000,000 records, in a table. There is a window of opportunity in which the application should run and try to process as many rows as it can. I am thinking that if I can have multiple threads of the app take 2000 records at a time & process them, it should be able to process more. However, that will be pretty costly on the database server. From what I am told, the db server is a beefy machine and should be able to handle the pressure.
Also, by only taking 2000 rows at a time, should the application terminate in the middle of its processing, it would know where to pick up again.
So, I guess what I am asking is ...
1) How can I have the app pick up rows & lock those rows so that it would not be assigned to another thread?
2) What kind of intelligence can be programmed into the app that will allow for it to pick up processing where it last left off from?
Thanks
KP
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您应该使用 SQL Server Integration Services (SSIS) 来完成此操作,而不是重新发明轮子。 它针对这些场景进行了高度优化,特别是在 2008 年版本中。
Rather than reinvent the wheel, you should do this using SQL Server Integration Services (SSIS). It is highly optimized for these scenarios, especially in the 2008 release.
我同意 John 的观点,即 SSIS 对于此类场景有很多内置智能,并且可能是投入时间的最佳选择。
作为记录,您可以通过对数据进行分区来解决此类问题。 我不是在谈论物理存储分区(即添加表分区),而是逻辑处理分区。 你分配你的200万。 N 个分区中的记录,基于您拥有的可以在数据访问级别利用的任何标准,例如。 一个索引列,然后分配 N 个处理器,每个处理器开始在自己的分区上进行搅动。 这个想法是避免处理器在尝试访问相同的行时重叠。 “处理器”可以是线程,或者更好的是使用异步数据库访问方法的 ThreadPool 排队工作项。
最大的问题是很多时候你没有合适的分区键。 在这种情况下,您可以执行如下所示的临时分区:
技巧是 select 中使用的锁定提示:通过强制和 updlock,记录被锁定以供当前处理器处理。 通过添加 readpast 提示,每个处理器将跳过已被其他处理器锁定的记录。 这样,无论处理是什么,每个处理器都会获得自己的 @batchSize 批次记录来处理。
重要的是要理解,所有这些注释都适用于涉及数据库外部的处理,例如进行 Web 服务调用、打印纸条或类似的操作。 如果处理全部在数据库中,那么您应该将其表示为单个 T-SQL 更新,并让查询优化器根据需要使用并行查询。
I agree with John that SSIS has a lot of built in intelligence for such scenarios and is probably the best bet to invest your time into.
For the record such problems you approach by partitioning your data. I'm not talking about the physical storage partitioning (ie. add table partitioning), but logical, processing partitioning. You partition your 2 mil. records in N partitions, based on whatever criteria you have that can be exploited at the data access level, eg. an indexed column, then allocate N processors that start churning each on its own partition. The idea is to not have the processors overlap in trying to access the same rows. 'Processors' can be threads, or better still ThreadPool queued up worker items that use async database access methods.
The big problem is that many times you don't have a suitable partitioning key. In such cases you can do an ad-hoc partitioning like this:
The trick is the locking hints used in the select: by forcing and updlock the records are locked for processing by the current processor. By adding the readpast hint each processor will skip records that are already locked by other processors. This way each processor gets its own @batchSize batch of records to process, whatever the processing is.
Important to understand that all these comments apply to a processing that involves something outside the database, like doing a web service call, printing a paper slip or anything similar. If the processing is all in the database, then you should just express it as a single T-SQL update and let the query optimizer use parallel queries it as it sees fit.
我会这样做:
处理线程:
这应该允许您大吞吐量和安全重新启动,前提是可以随时处理某些表 如果情况并非如此,那么重新启动就毫无意义了。
I'd do this:
Processing threads :
This should allow you big throughput and a safe restart provided it's ok to have some of your table unprocessed at any one time. If that's not the case then restarting is moot anyway.