屏幕抓取效率
我们将每晚抓取数千个网站来更新客户数据,并且我们正在决定使用哪种语言来进行抓取。
我们不局限于任何平台或语言,我只是追求效率。如果我必须学习一门新语言才能使我的服务器表现良好,那很好。
哪种语言/平台将为我们提供最高的每美元抓取效率?事实上,我正在寻找大量抓取的真实经验。这将是关于最大化 CPU/内存/带宽。
We are going to be scraping thousands of websites each night to update client data, and we are in the process of deciding which language we would like to use to do the scraping.
We are not locked into any platform or language, and I am simply looking for efficiency. If I have to learn a new language to make my servers perform well, that is fine.
Which language/platform will provide the highest scraping efficiency per dollar for us? Really I'm looking for real-world experience with high volume scraping. It will be about maximizing CPU/Memory/Bandwidth.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
无论如何,你都会受到 IO 的限制,代码的性能根本不重要(除非你是一个非常糟糕的程序员..)
You will be IO bound anyway, the performance of your code won't matter at all (unless you're a really bad programmer..)
使用 python 和 beautiful soup 的组合,可以非常轻松地快速编写屏幕抓取代码。美丽的汤有一个学习曲线,但这是值得的。
就效率而言,我想说它和其他任何方法一样快。我从来没有同时做过数千个网站,但我敢打赌这绝对能胜任这项任务。
Using a combination of python and beautiful soup it's incredibly easy to write scree-scraping code very quickly. There is a learning curve for beautiful soup, but it's worth it.
Efficiency-wise, I'd say it's just as quick as any other method out there. I've never done thousands of sites at once, but I'd wager that it's definitely up to the task.
对于网页抓取,我使用带有 lxml 和其他一些库的 Python: http://webscraping.com/blog
I/ O 是爬行时的主要瓶颈 - 要以良好的速率下载数据,您需要使用多个线程。
我缓存了所有下载的 HTML,因此内存使用率较低。
通常在爬行之后我需要重新抓取不同的特征,CPU 就变得很重要。
For web scraping I use Python with lxml and a few other libraries: http://webscraping.com/blog
I/O is the main bottleneck when crawling - to download data at a good rate you need to use multiple threads.
I cache all downloaded HTML, so memory use is low.
Often after crawling I need to rescrape different features, and CPU becomes important.
如果您了解 C,单流同步链接(称为“简单”方法)只需一天的时间即可完成 libcURL。多个异步流(称为“多”方法)需要几个小时。
If you know C, a single-stream synchronous link (called the "easy" method) is a short day's work with libcURL. Multiple asynchronous streams (called the "multi" method) is a couple hours more.
鉴于数千个站点所需的容量,您可以通过查看商业包来获得更经济的收益。他们消除了 IO 问题,并拥有专门设计的工具来处理每个站点之间的细微差别,以及用于规范数据的后抓取工具和调度以保持数据最新。
With the volume that thousands of sites would require, you may be better off economically by looking at commercial packages. They eliminate the IO problem, and have tools specifically designed to handle the nuances between every site, as well as post-scraping tools to normalize the data, and scheduling to keep the data current.
我建议 网页抓取语言
比较一个简单的 WSL 查询:
GOTO example.com > ;> EXTRACT {'column1':td[0], 'column2': td[1]} IN table.spad
使用以下示例:
I would recommend Web Scraping Language
compare a simple WSL query:
GOTO example.com >> EXTRACT {'column1':td[0], 'column2': td[1]} IN table.spad
with the following example: