比较多处理与扭曲的问题
遇到了我要解析网站的情况。每个站点都必须有自己的“解析器”,并且可能有自己的处理cookies/等的方式。
我正在努力思考哪个是更好的选择。
选择一: 我可以创建一个多处理函数,其中 (masterspawn) 应用程序获取输入 url,然后它跨越 masterspawn 应用程序内的进程/函数,然后处理页面/URL 的所有设置/获取/解析。
这种方法将运行一个主应用程序,然后它依次创建内部函数的多个实例。应该很快,是/否?
选择二: 我可以创建一个“Twisted”类型的服务器,它基本上会做与选择 I 相同的事情。不同之处在于使用“Twisted”也会带来一些开销。我正在尝试评估 Twisted,将其视为“服务器”,但我不需要它来执行 url 的获取。
选择三: 我可以使用 scrapy。我倾向于不走这条路,因为我不想/不需要使用 scrapy 似乎具有的开销。正如我所说,每个目标 URL 都需要自己的解析函数,以及处理 cookie...
我的目标是基本上将“架构”解决方案分布在多个盒子中,其中每个客户端盒子与主服务器连接分配要解析的 url。
感谢对此的任何评论..-
汤姆
Got a situation where I'm going to be parsing websites. each site has to have it's own "parser" and possibly it's own way of dealing with cookies/etc..
I'm trying to get in my head which would be a better choice.
Choice I:
I can create a multiprocessing function, where the (masterspawn) app gets an input url, and in turn it spans a process/function within the masterspawn app that then handles all the setup/fetching/parsing of the page/URL.
This approach would have one master app running, and it in turn creates multiple instances of the internal function.. Should be fast, yes/no?
Choice II:
I could create a "Twisted" kind of server, that would essentially do the same thing as Choice I. The difference being that using "Twisted" would also impose some overhead. I'm trying to evaluate Twisted, with regards to it being a "Server" but i don't need it to perform the fetching of the url.
Choice III:
I could use scrapy. I'm inclined not to go this route as I don't want/need to use the overhead that scrapy appears to have. As i stated, each of the targeted URLs needs its own parse function, as well as dealing with the cookies...
My goal is to basically have the "architected" solution spread across multiple boxes, where each client box interfaces with a master server that allocates the urls to be parsed.
thanks for any comments on this..
-tom
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个问题有两个维度:并发和分布。
并发:Twisted 或多处理将同时处理获取/解析作业。我不确定你的“扭曲的开销”的前提来自哪里。相反,多处理路径会产生更多的开销,因为必须生成(相对较重的)操作系统进程。 Twisteds 处理并发的方式要轻量得多。
分发:多重处理不会将您的获取/解析作业分发到不同的盒子。 Twisted 可以做到这一点,例如。使用 AMP 协议构建设施。
我无法评论 scrapy,因为我从未使用过它。
There are two dimensions to this question: concurrency and distribution.
Concurrency: either Twisted or multiprocessing will do the job of concurrently handling fetching/parsing jobs. I'm not sure though where your premise of the "Twisted overhead" comes from. On the contrary, the multiprocessing path would incur much more overhead, since a (relatively heavy-weight) OS-process would have to be spawned. Twisteds' way of handling concurrency is much more light-weight.
Distribution: multiprocessing won't distribute your fetch/parse jobs to different boxes. Twisted can do this, eg. using the AMP protocol building facilities.
I cannot comment on scrapy, never having used it.
对于这个特定的问题,我会选择多重处理 - 它易于使用且易于理解。您并不特别需要扭曲,所以为什么要承担额外的复杂性。
您可能需要考虑的另一种选择:使用消息队列。让主服务器将 URL 放入队列中(例如 beanstalkd、resque, 0mq )并让工作进程获取 URL 并进行处理。您将同时获得并发性和分发性:您可以在任意数量的机器上运行工作程序。
For this particular question I'd go with multiprocessing - it's simple to use and simple to understand. You don't particularly need twisted, so why take on the extra complication.
One other option you might want to consider: use a message queue. Have the master drop URLs onto a queue (eg. beanstalkd, resque, 0mq) and have worker processes pickup the URLs and process them. You'll get both concurrency and distribution: you can run workers on as many machines as you want.