urllib.urlopen() 如何工作?

发布于 2024-12-01 21:28:25 字数 601 浏览 5 评论 0原文

让我们考虑一个大文件(~100MB)。我们假设该文件是基于行的(文本文件,行相对较短,约为 80 个字符)。 如果我使用内置 open()/file() 文件将加载到 惰性方式。 IE 如果我执行aFile.readline(),则只有文件的一部分会驻留在内存中。 urllib.urlopen() 是否执行类似的操作(使用磁盘上的缓存)?

urllib.urlopen().readline()file().readline() 性能差异有多大?我们假设该文件位于本地主机上。一旦我用 urllib.urlopen() 打开它,然后用 file() 打开它。当我使用 readline() 循环文件时,性能/内存消耗会有多大差异?

处理通过 urllib.urlopen() 打开的文件的最佳方法是什么?逐行处理是不是更快?或者我应该将一堆行(~50)加载到列表中然后处理该列表?

Let's consider a big file (~100MB). Let's consider that the file is line-based (a text file, with relatively short line ~80 chars).
If I use built-in open()/file() the file will be loaded in lazy manner.
I.E. if a I do aFile.readline() only a chunk of a file will reside in memory. Does the urllib.urlopen() do something similar (with usage of a cache on disk)?

How big is the difference in performance between urllib.urlopen().readline() and file().readline()? Let's consider that file is located on localhost. Once I open it with urllib.urlopen() and then with file(). How big will be difference in performance/memory consumption when i loop over the file with readline()?

What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

半枫 2024-12-08 21:28:25

open (或 file)和 urllib.urlopen 看起来它们或多或少在那里做同样的事情。 urllib.urlopen(基本上)创建一个 socket._socketobject,然后调用 makefile 方法(该方法的内容包含在下面)

def makefile(self, mode='r', bufsize=-1):
    """makefile([mode[, bufsize]]) -> file object

    Return a regular file object corresponding to the socket.  The mode
    and bufsize arguments are as for the built-in open() function."""
    return _fileobject(self._sock, mode, bufsize)

open (or file) and urllib.urlopen look like they're more or less doing the same thing there. urllib.urlopen is (basically) creating a socket._socketobject and then invoking the makefile method (contents of that method included below)

def makefile(self, mode='r', bufsize=-1):
    """makefile([mode[, bufsize]]) -> file object

    Return a regular file object corresponding to the socket.  The mode
    and bufsize arguments are as for the built-in open() function."""
    return _fileobject(self._sock, mode, bufsize)
阪姬 2024-12-08 21:28:25

urllib.urlopen() 是否执行类似的操作(使用磁盘上的缓存)?

操作系统确实如此。当您使用 urllib 等网络 API 时,操作系统和网卡将执行低级工作,将数据分割成通过网络发送的小数据包,并接收传入的数据包。它们存储在缓存中,以便应用程序可以抽象出数据包概念并假装它将发送和接收连续的数据流。

urllib.urlopen().readline()file().readline() 性能差异有多大?

很难比较这两者。对于urllib,这取决于网络的速度以及服务器的速度。即使对于本地服务器,也存在一些抽象开销,因此通常从网络 API 读取比直接从文件读取要慢。

对于实际性能比较,您必须编写测试脚本并进行测量。然而,你为什么还要费心呢?您不能用另一种替代,因为它们有不同的用途。

处理通过 urllib.urlopen() 打开的文件的最佳方法是什么?逐行处理是不是更快?或者我应该将一堆行(~50)加载到列表中然后处理该列表?

由于瓶颈在于网络速度,因此最好在获得数据后立即对其进行处理。这样,操作系统可以“在后台”缓存更多传入数据。

在处理行之前将行缓存在列表中是没有意义的。你的程序将只是坐在那里等待足够的数据到达,而它可能已经在做一些有用的事情了。

Does the urllib.urlopen() do something similar (with usage of a cache on disk)?

The operating system does. When you use a networking API such as urllib, the operating system and the network card will do the low-level work of splitting data into small packets that are sent over the network, and to receive incoming packets. Those are stored in a cache, so that the application can abstract away the packet concept and pretend it would send and receive continuous streams of data.

How big is the difference in performance between urllib.urlopen().readline() and file().readline()?

It is hard to compare these two. For urllib, this depends on the speed of the network, as well as the speed of the server. Even for local servers, there is some abstraction overhead, so that, usually, it is slower to read from the networking API than from a file directly.

For actual performance comparisons, you will have to write a test script and do the measurement. However, why do you even bother? You cannot replace one with another since they serve different purposes.

What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?

Since the bottle neck is the networking speed, it might be a good idea to process the data as soon as you get it. This way, the operating system can cache more incoming data "in the background".

It makes no sense to cache lines in a list before processing them. Your program will just sit there waiting for enough data to arrive while it could be doing something useful already.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文