使用 urlretrieve 并行下载
我经常需要批量下载和重命名 HTML 页面,不久前为此编写了这个简单的代码:
import shutil
import os
import sys
import socket
socket.setdefaulttimeout(5)
file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
try:
import urllib.request
sl = line.strip().split(";")
url = sl[0]
newname = str(sl[1])+".html"
urllib.request.urlretrieve(url, newname)
except:
pass
file_read.close()
这对于几百个网站来说足够有效,但对于大量下载(20-50k)来说需要花费太长的时间。加快速度的最简单且最好的方法是什么?
I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:
import shutil
import os
import sys
import socket
socket.setdefaulttimeout(5)
file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
try:
import urllib.request
sl = line.strip().split(";")
url = sl[0]
newname = str(sl[1])+".html"
urllib.request.urlretrieve(url, newname)
except:
pass
file_read.close()
This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
A:
最简单 (评论的方法不是)&
最好的方法
至少是:
(a)
最小化所有开销(50k次
Thread
-实例化成本是这样一类成本),(b)
利用令人尴尬的独立性(但不是 True-
[PARALLEL]
)流程(c)
尽可能接近刚刚的前沿-
[CONCURRENT]
,延迟屏蔽流程鉴于
既简单又方便。性能似乎是“最佳”的衡量标准:
任何成本,首先不能证明通过如此多的性能提高来介绍自己的成本是合理的,其次,不会对 性能(加速) 是性能反模式&计算机科学不可饶恕的罪过。
因此
我无法推广使用 GIL 锁(通过设计,甚至只是阻止了
[CONCURRENT]
处理)绑定&在一个接一个接一个接另一个-...-re-[SERIAL]
-ized 链中任意数量的 Python 线程的逐步循环步进,性能令人窒息大约100 [ms]
- 代码解释时间的量子阻塞了一个且只有一个这样的Python线程被允许运行(所有其他线程都被阻塞等待......而是一个表现ANTI 模式,不是吗?),所以
宁可采用基于进程的并发工作流(对于 ~ 50k url 获取,性能会得到很大提高,其中有数百/数千个[ms]-延迟(协议和安全握手设置 + 远程 url 解码 + 远程内容组装 + 远程内容到协议封装 + 远程到本地网络流 + 本地协议解码 + ... )
草拟的流程框架:
这是最初的 Python 解释器 __main__ 端技巧,用于生成足够的工作进程,开始抓取
my_file
- 独立的 URL 的“列表”,并且确实只是 -[CONCURENT]
- 工作流程开始,一个独立于其他任何流程。通过引用传递给工作人员的
block_processor_FUN()
确实会简单地打开文件,并开始仅获取/处理其“自己的”部分,来自( wPROC / MAX_WORKERs )
到( ( wPROC + 1 ) / MAX_WORKERs )
的行数。那简单。
如果愿意调整一些极端情况,其中某些 URL 可能需要并且需要更长的时间,那么可以改进负载平衡公平队列的形式,但代价是更复杂的设计(许多进程到进程的消息队列可用),具有
{ __main__ | main() }
-side FQ/LB-feeder 并使工作进程从此类作业请求 FQ/LB-facility 检索下一个任务。更复杂&对于要服务的
my_file
有序 URL 列表中 URL 服务持续时间的不均匀分布更加稳健。您可以选择影响最终性能/稳健性的简单性/复杂性折衷级别。
有关更多详细信息,您可能想阅读此和代码这个以及定向示例或进一步提升性能的提示。
A :
The SIMPLEST ( what the commented approach is not ) &
the BEST way
is to at least :
(a)
minimise all overheads ( 50k times
Thread
-instantiation costs being one such class of costs ),(b)
harness embarrasing independence ( yet, not a being a True-
[PARALLEL]
) in process-flow(c)
go as close as possible to bleeding edges of a just-
[CONCURRENT]
, latency-masked process-flowGiven
both the simplicity & performance seem to be the measure of "best"-ness:
Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.
Therefore
I could not promote using GIL-lock (by-design even a just-
[CONCURRENT]
-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]
-ised chain of about100 [ms]
-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),so
rather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).
Sketched process-flow framework :
This is the initial Python-interpreter
__main__
-side trick to spawn just-enough worker-processes, that start crawling themy_file
-"list" of URL-s independently AND an indeed just-[CONCURENT]
-flow of work starts, one being independent of any other.The
block_processor_FUN()
, passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from( wPROC / MAX_WORKERs )
to( ( wPROC + 1 ) / MAX_WORKERs )
of it's number of lines.That simple.
If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a
{ __main__ | main() }
-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.More complex & more robust to uneven distribution of URL-serving durations "across" the
my_file
-ordered list of URL-s to serve.The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.
For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.