在 servlet 环境中处理批处理作业的线程
我有一个 Spring-MVC、Hibernate、(Postgres 9 db) Web 应用程序。管理员用户可以发送请求来处理近 200,000 条记录(通过联接从各个表收集的每条记录)。每周或每月请求一次此类操作(或者当数据达到大约 200,000/100,000 条记录的限制时)。在数据库端,我正确地实现了批处理。
问题:如此长时间运行的请求会占用服务器线程,导致普通用户受到影响。
要求:此请求的较长响应时间不是问题。我们所需要的不是让其他用户因为这个耗时的过程而受苦。
我的解决方案:
使用 Spring taskExecutor 抽象实现线程池。所以我可以用 5 或 6 个线程初始化我的线程池,并将 200,000 条记录分成更小的块,比如每个大小 1000。我可以在这些块中排队。为了进一步允许普通用户更快地访问数据库,也许我可以让每个可运行线程休眠 2 或 3 秒。 我认为这种方法的优点是:我们没有一次性执行巨大的数据库交互请求,而是跨越更长的时间进行异步设计。因此,其行为就像多个正常的用户请求。
请有经验的人对此发表一下意见吗? 我还阅读了有关使用面向消息的中间件(如 JMS/AMQP 或 Quartz Scheduling)实现相同行为的内容。但坦率地说,我认为在内部他们也会做同样的事情,即创建一个线程池并在作业中排队。那么为什么不使用 Spring 任务执行器,而不是仅仅为了这个功能在我的 Web 应用程序中添加一个全新的基础设施呢?
请分享您对此的看法,并让我知道是否还有其他更好的方法可以做到这一点? 再次强调:完全处理所有记录的时间并不重要,需要的是在此期间访问网络应用程序的普通用户不应该受到任何影响。
I have a Spring-MVC, Hibernate, (Postgres 9 db) Web app. An admin user can send in a request to process nearly 200,000 records (each record collected from various tables via joins). Such operation is requested on a weekly or monthly basis (OR whenever the data reaches to a limit of around 200,000/100,000 records). On the database end, i am correctly implementing batching.
PROBLEM: Such a long running request holds up the server thread and that causes the the normal users to suffer.
REQUIREMENT: The high response time of this request is not an issue. Whats required is not make other users suffer because of this time consuming process.
MY SOLUTION:
Implementing threadpool using Spring taskExecutor abstraction. So i can initialize my threadpool with say 5 or 6 threads and break the 200,000 records into smaller chunks say of size 1000 each. I can queue in these chunks. To further allow the normal users to have a faster db access, maybe I can make every runnable thread sleep for 2 or 3 secs.
Advantages of this approach i see is: Instead of executing a huge db interacting request in one go, we have a asynchronous design spanning over a larger time. Thus behaving like multiple normal user requests.
Can some experienced people please give their opinion on this?
I have also read about implementing the same beahviour with a Message Oriented Middleware like JMS/AMQP OR Quartz Scheduling. But frankly speaking, i think internally they are also gonna do the same thing i.e making a thread pool and queueing in the jobs. So why not go with the Spring taskexecutors instead of adding a completely new infrastructure in my web app just for this feature?
Please share your views on this and let me know if there is other better ways to do this?
Once again: the time to completely process all the records in not a concern, whats required is that normal users accessing the web app during that time should not suffer in any way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以并行执行任务并等待所有任务完成后再返回调用。为此,您需要使用 ExecutorCompletionService< /a> 自 5.0 起在 Java 标准中可用
简而言之,您使用容器的服务定位器来创建 ExecutorCompletionService 的实例,
如果您不想等待,则可以可以在后台处理作业而不阻塞当前线程,但是您将需要某种机制来通知客户端作业何时完成。这可以通过 JMS 实现,或者如果您有 ajax 客户端,那么它可以轮询更新。
Quartz也有作业调度机制,但是Java提供了标准的方式。
编辑:
我可能误解了这个问题。如果您不想要更快的响应,而是想要限制 CPU,请使用这种方法
您可以创建一个像 PollingThread 这样的内部类,其中每个作业包含 java.util.UUID 的批次以及 PollingThreads 的数量在外部定义班级。这将永远持续下去,并且可以进行调整以使您的 CPU 能够自由地处理其他请求
You can parallelize the tasks and wait for all of them to finish before returning the call. For this, you want to use ExecutorCompletionService which is available in Java standard since 5.0
In short, you use your container's service locator to create an instance of ExecutorCompletionService
If you do not want to wait then, you can process the jobs in the background without blocking the current thread but then you will need some mechanism to inform the client when the job has finished. That can be through JMS or if you have an ajax client then, it can poll for updates.
Quartz also has a job scheduling mechanism but, Java provides a standard way.
EDIT:
I might have misunderstood the question. If you do not want a faster response but rather you want to throttle the CPU, use this approach
You can make an inner class like this PollingThread where batches containing java.util.UUID for each job and the number of PollingThreads are defined in the outer class. This will keep going forever and can be tuned to keep your CPUs free to handle other requests
大数据库操作通常在凌晨触发,此时用户流量相当少。 (比如凌晨 1 点到凌晨 2 点……)一旦发现这一点,您就可以简单地安排一个作业在那个时间运行。石英在这里可以派上用场,具有基于时间的触发器。 (注意:也可以手动触发作业。)
处理的结果现在可以存储在不同的表中。 (我将其称为结果表)稍后,当用户想要此结果时,数据库操作将针对这些结果表,这些结果表具有最少的记录并且几乎没有任何连接会参与其中。
Quartz.jar 约为 350 kb,添加此依赖项应该不是问题。另请注意,没有理由需要将其作为网络应用程序。这几个执行 ETL 的类可以放置在独立模块中。网络应用程序只需要从结果表中获取
所有这些,如果您已经有一个主从数据库模型(与您的数据库管理员讨论),那么您可以执行巨大的数据库操作与从属数据库而不是普通用户将指向的主数据库。
Huge-db-operations are usually triggered at wee hours, where user traffic is pretty less. (Say something like 1 Am to 2 Am.. ) Once you find that out, you can simply schedule a job to run at that time. Quartz can come in handy here, with time based triggers. (Note: Manually triggering a job is also possible.)
The processed result could now be stored in different table(s). (I'll refer to it as result tables) Later when a user wants this result, the db operations would be against these result tables which have minimal records and hardly any joins would be involved.
Quartz.jar is ~ 350 kb and adding this dependency shouldn't be a problem. Also note that there's no reason this need to be as a web-app. These few classes that do ETL could be placed in a standalone module.The request from the web-app needs to only fetch from the result tables
All these apart, if you already had a master-slave db model(discuss on that with your dba) then you could do the huge-db operations with the slave-db rather than the master, which normal users would be pointed to.