如何在内存有限的 UNIX/LINUX 中读取大文件?
我有一个大文本文件要打开(例如 - 5GB 大小)。但由于 RAM 有限(需要 1 GB),如何打开和读取文件而不出现任何内存错误?我在安装了基本软件包的 Linux 终端上运行。
这是一个面试问题,所以请不要考虑实用性。
我不知道是在系统级别还是程序级别上查看它......如果有人能够对这个问题有所了解,那就太好了。
谢谢。
I have a large text file to be opened (eg- 5GB size). But with a limited RAM (take 1 GB), How can I open and read the file with out any memory error? I am running on a linux terminal with with the basic packages installed.
This was an interview question, hence please do not look into the practicality.
I do not know whether to look at it in System level or programmatic level... It would be great if someone can throw some light into this issue.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一个字符一个字符地读取它......或者X字节一个X字节......这实际上取决于你想用它做什么......只要你不需要一次需要整个文件,那就可以了。
(省略号很棒)Read it character by character... or X bytes by X bytes... it really depends what you want to do with it... As long as you don't need the whole file at once, that works.
(Ellipses are awesome)他们希望您对该文件做什么?你在找东西吗?提取东西?排序?这会影响你的方法。
如果您要查找某些内容,逐行或逐字符读取文件可能就足够了。如果您需要跳转文件或分析文件的各个部分,那么很可能需要对其进行内存映射。查找mmap()。这是一篇关于该主题的简短文章:内存映射 I/O
What do they want you to do with the file? Are you looking for something? Extracting something? Sorting? This will affect your approach.
It may be sufficient to read the file line by line or character by character if you're looking for something. If you need to jump around the file or analyze sections of it, then most likely want to memory map it. Look up mmap(). Here's an short article on the subject:memory mapped i/o
[只是评论]
如果您打算使用系统调用(open() 和 read()),那么逐字符读取将生成大量系统调用,从而严重减慢您的应用程序速度。即使存在缓冲区高速缓存(或磁盘文件),系统调用的成本也很高。
最好逐块读取,其中块大小“应该”大于 1MB。如果块大小为 1MB,您将发出 5*1024 系统调用。
[just comment]
If you are going to use system calls (open() and read()), then reading character by character will generate a lot of system calls that severely slow down your application. Even with the existence of the buffer cache (or disk file), system calls are expensive.
It is better to read block by block where block size "SHOULD" be more than 1MB. In case of 1MB block size, you will issue 5*1024 system calls.