如何构建 .NET 应用程序以同时多次执行相同的任务独立?
我需要开发一个与网络蜘蛛/爬虫非常相似的.NET应用程序。从网站获取数据、处理数据、将数据保存在数据库中并发送电子邮件。
我想在机器可以的情况下一次处理尽可能多的站点(在合理范围内)。每个进程都是相互独立的。我将使用一些第三方服务器组件,例如来自 Chilkat Software 的组件。仅使用一台计算机。从 Windows 7 64 位开始,然后转到 Windows Server。
我应该使用什么架构或设计来处理我提到的要求?运行应用程序的多个实例(最简单的方法)?使用 Windows WorkFlow Foundation(从未使用过)?某种并行处理? ..? 一个指向遵循建议设计的示例应用程序的指针是一个优点。
I have a need to develop a .NET app which is very similar to a web spider/crawler. Get data from a website, process data, save data in a database and send an email.
I want to process as many sites at once as the machine can (within reason). Each process is independent of each other. I will be using some third party server components, like from Chilkat Software. Only a single computer is used. Starting with Windows 7 64bit then going to Windows Server.
What architecture or design should I use which handles the requirements I mentioned? Running several instances of the app (easiest way)? Using Windows WorkFlow Foundation (Never used it)? Some kind of parallel processing? ..?
A pointer to a sample app which follows the proposed design is a plus.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
可以使用管道架构:crawl ->流程->保存到数据库->电子邮件;应该使用线程安全队列来连接不同的阶段;每个阶段可以单独设置为使用N个线程。然后在生产环境中,测量和调整每个阶段可以使用的线程数,以便大多数时间没有阶段等待其他阶段提供/消耗数据。
请注意,还有许多其他因素需要调整才能获得最佳结果。示例:假设您的数据库每秒最多可以处理一次保存,但数据库之前的管道每秒可以轻松生成十页,在这种情况下,您可能希望将数据库和进程之间的队列大小限制为一个较小的数字。
调整所有这些因素并观察它们如何相互作用是有趣的。与简单的多线程/处理方法相比,您会惊讶地发现机器的性能如何。
You can use a pipeline architecture: crawl -> process -> save to db -> email; threading-safe queues should be used to connect different phases; each phase can be individually set to use N threads. Then in production environment, measure and tune the number of threads each phase can use such that no phase is waiting for other phases to provide/consume the data for most of the time.
Be aware that there are many other factors to adjust for the best result. Example: suppose your database can handle at most one save per second, but the pipe before database can easily produce ten pages per second, in this case, you many want to limit the queue size between database and process to a somewhat small number.
Tuning all these factors and watching how they interact with each other are interest and fun. You will be surprised to see how the machine can perform compared to a simply-go-multi-threading/processing approach.
我建议使用 System.Threading.Tasks 库 对于这样的事情。
然后你可以在你的应用程序中执行类似的操作:
I'd recommend using the System.Threading.Tasks library for something like this.
You could then do something like this in your app:
工作流绝对可以用来做这类事情。它在跟踪方面具有一些显着的优势,可以为您提供发生的所有事情的详细日志,并使处理多个异步任务变得容易。
鉴于您从未使用过它,对您来说的缺点将是增加。我们确实提供实践实验室来帮助您快速上手。
请参阅我们的工作流程初学者指南页面上的动手实验
Workflow can definitely be used to do this sort of thing as well. It has some significant advantages with tracking that provides you a detailed log of everything that occured and it makes handling of multiple async tasks easy.
Given that you have never used it the downside for you will be the ramp up. We do provide hands on labs to get you going quickly.
See the hands on labs on our Beginners Guide To Workflow page