寻找文本文件
我正在处理非常大的文本文件,2GB 甚至更多。我想要一个类似 Seek() 的函数。有人做过类似的事情吗?加载到 TStringList 是不可能的。也可以处理非类型化文件。目前我正在使用 readLn,但持续时间太长。谢谢。
I'm working with very large text files, 2GB and more. I would like to have a Seek() like function. Has anyone done something like that? Loading to TStringList is out of the question. Also working with untyped file as well. For now I'm using readLn, but that lasts too long. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
将文件逐块映射到内存(CreateFileMapping/MapViewOfFile),然后扫描映射的内存并构建索引 - 每行开头的位置列表。然后,您的查找操作将通过获取文件中第 N 行的位置并查找到该位置来执行。然后使用 TFileStream 对文件执行随机访问,或者,如果您只读取文件,也可以使用文件映射进行随机访问 - 这可能比并行使用 TFileStream 与文件映射更快。
Map the file into memory (CreateFileMapping/MapViewOfFile) by pieces, then scan the mapped memory and build an index - the list of positions of each line beginnings. Then your seek operation will be performed by getting position of Nth line in the file and seeking to this position. Use TFileStream then to perform random access to the file or, if you only read the file, you can use file mappings for random access as well - this might be even faster than using TFileStream in parallel to file mapping.
尝试GpHugeFile。
Try GpHugeFile.
你设置了一些相当严格的边界条件。
我唯一能想到的就是尝试从文本文件中获取句柄,并使用 win32 函数直接查找。但要注意文本文件缓存。
如果使用 writeln/readln 的大型代码库是原因,那么实现允许它(或简化缓存)的您自己的文本文件驱动程序可能是解决方案。
Free Pascal 有一个 getfilehandle 函数用于此目的,从 textfile/tfilerec 文件中检索操作系统句柄。我不知道最近Delphi在这个部门添加了什么。
You set some pretty hard borderconditions.
The only thing I can imagine is to try to get the handle from the textfile, and use win32 functions to seek directly. Beware of textfile caching though.
If large codebases using writeln/readln are the reason, implementing your own textfile driver that allows it (or simplifies caching) might be the solution.
Free Pascal has a getfilehandle function for this purpose, to retrieve the OS handle from textfile/tfilerec files. I don't know what recent Delphi's add in this department.
如果您需要行级粒度而不是字节级,则绝对没有办法避免至少读取一次整个文件以找到行结束标记(LF 或 CRLF,具体取决于您的环境。) 这是硬限制——你无法提前知道你的行尾在哪里。
在构建行尾到字节偏移索引后,您可以将其缓存在磁盘上,并使用启发式“上次修改时间”来检查索引是否需要重新生成(您需要启发式,因为您无法确保文件内容没有改变,除非通过阅读它,然后你可能会重建索引,因为你无论如何都会受到 I/O 限制。)
正如其他人所建议的,底层机制必须是 CreateFileMapping / CreateViewOfFile (或 POSIX 下的 mmap。)
If you need line-level granularity instead of byte-level, there is absolutely no way to avoid reading through the entire file at least once in order to find the end of line markers (LF or CRLF, depending on your environment.) This is a hard limit - you can't know in advance where your end of line is going to be.
After building the end of line to byte offset index you could conceivably cache it on-disk and use a heuristic a la "last modified time" to check whether the index needs to be regenerated (you need a heuristic because you can't ensure that the file contents hasn't changed except by reading through it, and then you might as well rebuild the index since you'll be I/O bound anyway.)
As suggested by others, the underlying mechanism will have to be CreateFileMapping / CreateViewOfFile (or mmap under POSIX.)
您可以使用此函数更改 TText 文件中的当前位置:
成功时返回 true,出错时返回 false(未打开文件的无效位置)。
如果您想快速访问,请确保已设置 {$I-} 并手动检查 IOResult,并使用一些缓冲区调用 System.SetTextBuffer()(1 KB 到 64 KB 都可以)。
You can use this function to change the current position in a TText file:
It will return true on success, false on error (invalid position of file not opened).
If you want to have fast access, ensure that you have set {$I-} and check IOResult by hand, and have called System.SetTextBuffer() with some buffer (1 KB up to 64 KB could make sense).