实时批量数据处理
我的任务是优化线性数据处理例程的性能。以下是对已完成内容的概述:
数据通过 UDP 端口传入,我们有多个侦听器侦听不同的端口并将原始数据写入 SQL Server 数据库(我们将该表称为 RawData)。然后,我们有一个单线程线性应用程序的多个实例,从 RawData 表中获取原始数据并处理各个数据行。处理的意思是将原始数据与给定实体之前接收的数据进行比较,进行计算以计算不同读数的数量,然后为每个单独的数据行调用几个 Web 服务,最后为每个数据添加一条新记录ProcessedData 表中的行。相应的实体记录也在其他表中更新。
我看待问题的方式是,它可以分解为更小的部分,我可以利用生产者/消费者模式进行数据处理: 生产者的一个线程填充一个共享(阻塞)队列,多个消费者从队列中获取数据行并对它们进行并行处理。消费者完成处理后,会将处理后的数据放入另一个共享队列,然后另一个消费者线程(单个)将访问该队列,该线程将执行 SqlBulkCopy 来插入新记录。在此过程中,将有其他共享队列存储用于更新的实体信息,而另一个使用者将获取实体的更新信息并执行更新。
问题是,尽管它看起来很简单,但在我看来这是一种麻烦的方法。我确实觉得有更好的方法来做我正在寻找的事情。对于实现上述生产者/消费者模式有什么建议吗?或者我应该为我的问题寻找不同的设计模式?
提前致谢
I am tasked with optimizing a performance of a linear data processing routine. Here's an overview of what's already in place:
Data comes in on UDP ports, we have multiple listeners listening on different port and writing raw data to SQL Server database (lets call the table a RawData). Then we have multiple instances of a single threaded linear application grabbing raw data from RawData table and processing individual datarows. What processing means is the raw data is compared to previously received data for the given entity, calculations are done to calculate number of different readings, then couple of web services are called for each individual data row and finally a new record is added for each data row in ProcessedData table. Also corresponding entity record is updated in other table.
The way i see the problem, it can be broken down into smaller parts and i could utilize Producer/Consumer pattern for data processing:
One thread of producer populates a shared (blocking) queue, multiple Consumers grab data rows from the queue and do parallel processing of them. After Consumers are done they put the processed data to another shared queue, which then will be accessed by yet another consumer thread (single) that will do a SqlBulkCopy to insert new records. Along the process there will be other shared queue that will store entity info for updates and yet another consumer will be grabbing updated information for the entities and performing updates.
Question is, even though it seems straight forward, it looks to me to be a cumbersome approach. I do feel there's a better way of doing what i'm looking for. Any suggestions on implementing the above Producer/Consumer pattern? Or should i look for a different design pattern for my problem?
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您提出的解决方案听起来很合理,而且我一点也不认为它很麻烦。它易于理解、易于实施、有效且高效。它还允许您调整生产者和消费者的数量以实现最佳性能。分解为更小的部分,并且各部分之间的通信有限,这是一件非常好的事情。
因此,您拥有的是多个线程(生产者)从 UDP 读取数据并将这些项目存储在共享队列中。将其称为
RawData
队列。多个使用者从该队列中读取数据、处理项目并将结果放入另一个共享队列中。将其称为 ProcessedData 队列。最后,您有一个线程读取 ProcessedData 队列并将项目存储在数据库中。.NET
BlockingCollection
非常适合此目的。这可能会有所帮助:Question on C# threading with RFID
Your proposed solution sounds reasonable, and I don't view it as cumbersome at all. It's simple to understand, simple to implement, effective, and efficient. It also allows you to tune the number of producers and consumers to achieve the best performance. Decomposition into smaller parts with limited communication among the parts is a very good thing.
So what you have is multiple threads (producers) reading data from UDP and storing those items in a shared queue. Call it the
RawData
queue. Multiple consumers read from that queue, process items, and place the results into another shared queue. Call it theProcessedData
queue. Finally, you have a single thread that reads theProcessedData
queue and stores items in the database.The .NET
BlockingCollection
is perfect for this.This might be of some help: Question on C# threading with RFID