使用 urlretrieve 并行下载

发布于 2025-01-11 05:54:11 字数 526 浏览 1 评论 0原文

我经常需要批量下载和重命名 HTML 页面，不久前为此编写了这个简单的代码：

import shutil
import os
import sys
import socket

socket.setdefaulttimeout(5)   

file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
    try:
        import urllib.request 
        sl = line.strip().split(";")
        url = sl[0]
        newname = str(sl[1])+".html"
        urllib.request.urlretrieve(url, newname)
    except:
        pass
file_read.close()

这对于几百个网站来说足够有效，但对于大量下载（20-50k）来说需要花费太长的时间。加快速度的最简单且最好的方法是什么？

原文

I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:

import shutil
import os
import sys
import socket

socket.setdefaulttimeout(5)   

file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
    try:
        import urllib.request 
        sl = line.strip().split(";")
        url = sl[0]
        newname = str(sl[1])+".html"
        urllib.request.urlretrieve(url, newname)
    except:
        pass
file_read.close()

This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜吻♂芭芘 2025-01-18 05:54:11

问：
_{“我经常必须...
最简单的是什么？ > 以及加快速度的最佳方法？”}

A：
最简单 （评论的方法不是）&
最好的方法
至少是：
(a)
最小化所有开销（50k次Thread-实例化成本是这样一类成本），
(b)
利用令人尴尬的独立性（但不是 True-[PARALLEL] ）流程
(c)
尽可能接近刚刚的前沿-[CONCURRENT]，延迟屏蔽流程

鉴于
既简单又方便。性能似乎是“最佳”的衡量标准：

任何成本，首先不能证明通过如此多的性能提高来介绍自己的成本是合理的，其次，不会对性能（加速）是性能反模式&计算机科学不可饶恕的罪过。

因此
我无法推广使用 GIL 锁（通过设计，甚至只是阻止了[CONCURRENT]处理）绑定&在一个接一个接一个接另一个-...-re-[SERIAL]-ized 链中任意数量的 Python 线程的逐步循环步进，性能令人窒息大约100 [ms] - 代码解释时间的量子阻塞了一个且只有一个这样的Python线程被允许运行（所有其他线程都被阻塞等待......而是一个表现ANTI 模式，不是吗？），
所以
宁可采用基于进程的并发工作流（对于 ~ 50k url 获取，性能会得到很大提高，其中有数百/数千个[ms]-延迟（协议和安全握手设置 + 远程 url 解码 + 远程内容组装 + 远程内容到协议封装 + 远程到本地网络流 + 本地协议解码 + ... ）

草拟的流程框架：

from joblib import Parallel, delayed

MAX_WORKERs = ( n_CPU_cores - 1 )

def main( files_in ):
    """                                                     __doc__
    .INIT worker-processes, each with a split-scope of tasks
    """
    IDs = range( max( 1, MAX_WORKERs ) )
    RES_if_need = Parallel( n_jobs = MAX_WORKERs
                            )(       delayed( block_processor_FUN #-- fun CALLABLE
                                              )( my_file, #---------- fun PAR1
                                                 wPROC    #---------- fun PAR2
                                                 )
                                              for wPROC in IDs
                                     )

def block_processor_FUN( file_with_URLs = None,
                         file_from_PART = 0
                         ):
    """                                                     __doc__
    .OPEN file_with_URLs
    .READ file_from_PART, row-wise - till next part starts
                                   - ref. global MAX_WORKERs
    """
    ...

这是最初的 Python 解释器 __main__ 端技巧，用于生成足够的工作进程，开始抓取 my_file - 独立的 URL 的“列表”，并且确实只是 -[CONCURENT] - 工作流程开始，一个独立于其他任何流程。

通过引用传递给工作人员的 block_processor_FUN() 确实会简单地打开文件，并开始仅获取/处理其“自己的”部分，来自 ( wPROC / MAX_WORKERs ) 到 ( ( wPROC + 1 ) / MAX_WORKERs ) 的行数。

那简单。

如果愿意调整一些极端情况，其中某些 URL 可能需要并且需要更长的时间，那么可以改进负载平衡公平队列的形式，但代价是更复杂的设计（许多进程到进程的消息队列可用），具有 { __main__ | main() }-side FQ/LB-feeder 并使工作进程从此类作业请求 FQ/LB-facility 检索下一个任务。

更复杂&对于要服务的 my_file 有序 URL 列表中 URL 服务持续时间的不均匀分布更加稳健。

您可以选择影响最终性能/稳健性的简单性/复杂性折衷级别。

有关更多详细信息，您可能想阅读此和代码这个以及定向示例或进一步提升性能的提示。

Q :
_{" I regularly have to ...
What would be the simplest and best way to speed it up ? "}

A :
The SIMPLEST ( what the commented approach is not ) &
the BEST way
is to at least :
(a)
minimise all overheads ( 50k times Thread-instantiation costs being one such class of costs ),
(b)
harness embarrasing independence ( yet, not a being a True-[PARALLEL] ) in process-flow
(c)
go as close as possible to bleeding edges of a just-[CONCURRENT], latency-masked process-flow

Given
both the simplicity & performance seem to be the measure of "best"-ness:

Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.

Therefore
I could not promote using GIL-lock (by-design even a just-[CONCURRENT]-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]-ised chain of about 100 [ms]-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),
so
rather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).

Sketched process-flow framework :

from joblib import Parallel, delayed

MAX_WORKERs = ( n_CPU_cores - 1 )

def main( files_in ):
    """                                                     __doc__
    .INIT worker-processes, each with a split-scope of tasks
    """
    IDs = range( max( 1, MAX_WORKERs ) )
    RES_if_need = Parallel( n_jobs = MAX_WORKERs
                            )(       delayed( block_processor_FUN #-- fun CALLABLE
                                              )( my_file, #---------- fun PAR1
                                                 wPROC    #---------- fun PAR2
                                                 )
                                              for wPROC in IDs
                                     )

def block_processor_FUN( file_with_URLs = None,
                         file_from_PART = 0
                         ):
    """                                                     __doc__
    .OPEN file_with_URLs
    .READ file_from_PART, row-wise - till next part starts
                                   - ref. global MAX_WORKERs
    """
    ...

This is the initial Python-interpreter __main__-side trick to spawn just-enough worker-processes, that start crawling the my_file-"list" of URL-s independently AND an indeed just-[CONCURENT]-flow of work starts, one being independent of any other.

The block_processor_FUN(), passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from ( wPROC / MAX_WORKERs ) to ( ( wPROC + 1 ) / MAX_WORKERs ) of it's number of lines.

That simple.

If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a { __main__ | main() }-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.

More complex & more robust to uneven distribution of URL-serving durations "across" the my_file-ordered list of URL-s to serve.

The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.

For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.

回复收藏 0 原文

~没有更多了~