我有一项可爱的任务,就是研究如何处理加载到应用程序脚本编辑器中的大文件(就像 VBA 用于我们用于快速宏的内部产品)。大多数文件约为 300-400 KB,可以很好地加载。但是当它们超过 100 MB 时,该过程就会变得困难(正如您所期望的那样)。
发生的情况是,文件被读取并推入 RichTextBox,然后进行导航 - 不要太担心这部分。
编写初始代码的开发人员只是使用 StreamReader,
[Reader].ReadToEnd()
这可能需要相当长的时间才能完成。
我的任务是分解这段代码,将其分块读入缓冲区,并显示一个带有取消它的选项的进度条。
一些假设:
- 大多数文件为 30-40MB
- 文件内容是文本(不是二进制),有些是 Unix 格式,有些是 DOS 格式。
- 一旦检索到内容,我们就可以确定使用的终止符。
- 加载后,没有人会担心在 Richtextbox 中渲染所需的时间。这只是文本的初始负载。
现在的问题是:
- 我可以简单地使用 StreamReader,然后检查 Length 属性(因此 ProgressMax)并针对设置的缓冲区大小发出 Read 并在后台工作程序内的 while 循环中进行迭代,所以它不会阻塞主 UI 线程吗?然后在完成后将字符串构建器返回到主线程。
- 内容将进入 StringBuilder。如果长度可用,我可以用流的大小初始化 StringBuilder 吗?
这些(以您的专业观点)是好主意吗?过去,我在从 Streams 读取内容时遇到了一些问题,因为它总是会丢失最后几个字节或其他内容,但如果是这种情况,我会问另一个问题。
I've got the lovely task of working out how to handle large files being loaded into our application's script editor (it's like VBA for our internal product for quick macros). Most files are about 300-400 KB which is fine loading. But when they go beyond 100 MB the process has a hard time (as you'd expect).
What happens is that the file is read and shoved into a RichTextBox which is then navigated - don't worry too much about this part.
The developer who wrote the initial code is simply using a StreamReader and doing
[Reader].ReadToEnd()
which could take quite a while to complete.
My task is to break this bit of code up, read it in chunks into a buffer and show a progressbar with an option to cancel it.
Some assumptions:
- Most files will be 30-40 MB
- The contents of the file is text (not binary), some are Unix format, some are DOS.
- Once the contents is retrieved we work out what terminator is used.
- No-one's concerned once it's loaded the time it takes to render in the richtextbox. It's just the initial load of the text.
Now for the questions:
- Can I simply use StreamReader, then check the Length property (so ProgressMax) and issue a Read for a set buffer size and iterate through in a while loop WHILST inside a background worker, so it doesn't block the main UI thread? Then return the stringbuilder to the main thread once it's completed.
- The contents will be going to a StringBuilder. can I initialise the StringBuilder with the size of the stream if the length is available?
Are these (in your professional opinions) good ideas? I've had a few issues in the past with reading content from Streams, because it will always miss the last few bytes or something, but I'll ask another question if this is the case.
发布评论
评论(13)
您可以通过使用 BufferedStream 来提高读取速度,如下所示:
2013 年 3 月更新
我最近编写了用于读取和处理(在其中搜索文本)1 GB 左右的文本文件(比文件大得多)的代码此处涉及)并通过使用生产者/消费者模式实现了显着的性能提升。生产者任务使用 BufferedStream 读取文本行,并将它们交给执行搜索的单独消费者任务。
我以此为契机学习 TPL Dataflow,它非常适合快速编码此模式。
为什么 BufferedStream 更快
2014 年 12 月更新:您的情况可能会有所不同
根据评论,FileStream 应该使用 BufferedStream 内部。在第一次提供这个答案时,我通过添加 BufferedStream 测量到了显着的性能提升。当时我的目标是 32 位平台上的 .NET 3.x。今天,针对 64 位平台上的 .NET 4.5,我没有看到任何改进。
相关
我遇到过一种情况,将生成的大型 CSV 文件从 ASP.Net MVC 操作流式传输到响应流非常慢。在本例中,添加 BufferedStream 将性能提高了 100 倍。有关详细信息,请参阅无缓冲输出非常慢
You can improve read speed by using a BufferedStream, like this:
March 2013 UPDATE
I recently wrote code for reading and processing (searching for text in) 1 GB-ish text files (much larger than the files involved here) and achieved a significant performance gain by using a producer/consumer pattern. The producer task read in lines of text using the
BufferedStream
and handed them off to a separate consumer task that did the searching.I used this as an opportunity to learn TPL Dataflow, which is very well suited for quickly coding this pattern.
Why BufferedStream is faster
December 2014 UPDATE: Your Mileage May Vary
Based on the comments, FileStream should be using a BufferedStream internally. At the time this answer was first provided, I measured a significant performance boost by adding a BufferedStream. At the time I was targeting .NET 3.x on a 32-bit platform. Today, targeting .NET 4.5 on a 64-bit platform, I do not see any improvement.
Related
I came across a case where streaming a large, generated CSV file to the Response stream from an ASP.Net MVC action was very slow. Adding a BufferedStream improved performance by 100x in this instance. For more see Unbuffered Output Very Slow
如果您阅读了本网站上的性能和基准统计数据,您将看到读取(因为读取、写入和处理都不同)文本文件的最快方法是以下代码片段:
总共大约有 9 种不同的方法标记,但大多数情况下,它似乎都领先,甚至像其他读者提到的那样,比执行缓冲读取器还要好。
If you read the performance and benchmark stats on this website, you'll see that the fastest way to read (because reading, writing, and processing are all different) a text file is the following snippet of code:
All up about 9 different methods were bench marked, but that one seem to come out ahead the majority of the time, even out performing the buffered reader as other readers have mentioned.
您说您被要求在加载大文件时显示进度条。这是因为用户真的想看到文件加载的确切百分比,还是只是因为他们想要看到正在发生的事情的视觉反馈?
如果后者为真,那么解决方案就会变得简单得多。只需在后台线程上执行
reader.ReadToEnd()
,并显示一个选取框类型的进度条而不是正确的进度条。我提出这一点是因为根据我的经验,这种情况经常发生。当你在编写数据处理程序时,那么用户肯定会对百分比完整的数字感兴趣,但对于简单但缓慢的 UI 更新,他们更有可能只想知道计算机有没有崩溃。 :-)
You say you have been asked to show a progress bar while a large file is loading. Is that because the users genuinely want to see the exact % of file loading, or just because they want visual feedback that something is happening?
If the latter is true, then the solution becomes much simpler. Just do
reader.ReadToEnd()
on a background thread, and display a marquee-type progress bar instead of a proper one.I raise this point because in my experience this is often the case. When you are writing a data processing program, then users will definitely be interested in a % complete figure, but for simple-but-slow UI updates, they are more likely to just want to know that the computer hasn't crashed. :-)
使用后台工作者并只读取有限数量的行。仅当用户滚动时才能阅读更多内容。
并尽量不要使用 ReadToEnd()。这是您认为“他们为什么要这样做?”的功能之一;它是一个脚本小子助手,可以很好地处理小事情,但正如你所见,它很糟糕大文件...
那些告诉您使用 StringBuilder 的人需要更频繁地阅读 MSDN:
性能注意事项
Concat 和 AppendFormat 方法都将新数据连接到现有 String 或 StringBuilder 对象。 String 对象串联操作始终从现有字符串和新数据创建一个新对象。 StringBuilder 对象维护一个缓冲区来容纳新数据的串联。如果有可用空间,新数据将附加到缓冲区末尾;否则,分配一个新的、更大的缓冲区,将原始缓冲区中的数据复制到新缓冲区,然后将新数据附加到新缓冲区。
String 或 StringBuilder 对象的串联操作的性能取决于内存分配发生的频率。
String 串联操作始终分配内存,而 StringBuilder 串联操作仅在 StringBuilder 对象缓冲区太小而无法容纳新数据时才分配内存。因此,如果连接固定数量的 String 对象,则 String 类更适合连接操作。在这种情况下,编译器甚至可能将各个串联操作组合成单个操作。如果连接任意数量的字符串,则 StringBuilder 对象更适合连接操作;例如,如果循环连接随机数量的用户输入字符串。
这意味着大量内存分配,这会导致大量使用交换文件系统,模拟硬盘的各个部分驱动器的作用类似于 RAM 内存,但硬盘驱动器的速度非常慢。
对于以单用户身份使用系统的人来说,StringBuilder 选项看起来不错,但是当有两个或更多用户同时读取大文件时,就会遇到问题。
Use a background worker and read only a limited number of lines. Read more only when the user scrolls.
And try to never use ReadToEnd(). It's one of the functions that you think "why did they make it?"; it's a script kiddies' helper that goes fine with small things, but as you see, it sucks for large files...
Those guys telling you to use StringBuilder need to read the MSDN more often:
Performance Considerations
The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object. A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer.
The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs.
A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.
That means huge allocation of memory, what becomes large use of swap files system, that simulates sections of your hard disk drive to act like the RAM memory, but a hard disk drive is very slow.
The StringBuilder option looks fine for who use the system as a mono-user, but when you have two or more users reading large files at the same time, you have a problem.
对于二进制文件,我发现的最快的读取它们的方法是这样的。
在我的测试中,速度快了数百倍。
For binary files, the fastest way of reading them I have found is this.
In my tests it's hundreds of times faster.
这应该足以让您开始。
This should be enough to get you started.
看看下面的代码片段。您提到过
大多数文件大小为 30-40 MB
。声称在 Intel 四核处理器上可在 1.4 秒内读取 180 MB:原创文章
Have a look at the following code snippet. You have mentioned
Most files will be 30-40 MB
. This claims to read 180 MB in 1.4 seconds on an Intel Quad Core:Original Article
所有优秀的答案!然而,对于寻找答案的人来说,这些似乎有些不完整。
由于标准字符串只能为 X 大小、2Gb 到 4Gb,具体取决于您的配置,这些答案并不能真正满足 OP 的问题。一种方法是使用字符串列表:
有些人可能希望在处理时对行进行标记和分割。字符串列表现在可以包含大量文本。
All excellent answers! however, for someone looking for an answer, these appear to be somewhat incomplete.
As a standard String can only of Size X, 2Gb to 4Gb depending on your configuration, these answers do not really fulfil the OP's question. One method is to work with a List of Strings:
Some may want to Tokenise and split the line when processing. The String List now can contain very large volumes of Text.
虽然获得最多支持的答案是正确的,但它缺乏多核处理的使用。就我而言,我使用 PLink 有 12 个核心:
值得一提的是,我在面试时遇到了一个问题,要求返回出现次数最多的 10 个:
基准测试:
BenchmarkDotNet=v0.12.1,操作系统=Windows 10.0.19042
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake),1 个 CPU,12 个逻辑核心和 6 个物理核心
[主机]:.NET Framework 4.8 (4.8.4250.0)、X64 RyuJIT
默认作业:.NET Framework 4.8 (4.8.4250.0)、X64 RyuJIT
。
正如您所看到的,性能提高了 75%
但请注意,7Gb 会立即加载到内存中,并且由于它是一个 blob,因此会给 GC 带来太大的压力。
Whilst the most upvoted answer is correct but it lacks usage of multi-core processing. In my case, having 12 cores I use PLink:
Worth mentioning, I got that as an interview question asking return Top 10 most occurrences:
Benchmarking:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
[Host] : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
DefaultJob : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
And as you can see it's 75% performance improvement.
But please note that the 7Gb is instantly loaded in the memory and since it's a blob it puts too much pressure on GC.
您可能最好使用内存映射文件处理此处。内存映射文件支持将存在于 .NET 4 中(我想...我通过其他人谈论它听说过),因此这个包装器使用 p/invokes 来完成相同的工作。.
编辑:请参阅此处MSDN 了解其工作原理,这里是 博客条目指示在即将发布的 .NET 4 发布时如何完成该操作。我之前给出的链接是 pinvoke 的包装器,以实现此目的。您可以将整个文件映射到内存中,并在滚动文件时像滑动窗口一样查看它。
You might be better off to use memory-mapped files handling here.. The memory mapped file support will be around in .NET 4 (I think...I heard that through someone else talking about it), hence this wrapper which uses p/invokes to do the same job..
Edit: See here on the MSDN for how it works, here's the blog entry indicating how it is done in the upcoming .NET 4 when it comes out as release. The link I have given earlier on is a wrapper around the pinvoke to achieve this. You can map the entire file into memory, and view it like a sliding window when scrolling through the file.
距离上次回答已经过去了 10 多年,这是我的解决方案,用于读取超过 10Gb 的文本文件并按照您所需的长度返回结果。放在这里以防有人寻求帮助:)
Its been more than 10 years since the last answers, This is my solution to read the text files of more than 10Gb and return the result following your required length. Putting here in case anyone seeking help :)
迭代器可能非常适合此类工作:
您可以使用以下方式调用它:
加载文件时,迭代器将返回从 0 到 100 的进度数字,您可以使用它来更新进度栏。循环完成后,StringBuilder 将包含文本文件的内容。
另外,因为您需要文本,我们可以使用 BinaryReader 读取字符,这将确保您的缓冲区在读取任何多字节字符时正确排列(UTF-8, UTF- 16等)。
这一切都是在不使用后台任务、线程或复杂的自定义状态机的情况下完成的。
An iterator might be perfect for this type of work:
You can call it using the following:
As the file is loaded, the iterator will return the progress number from 0 to 100, which you can use to update your progress bar. Once the loop has finished, the StringBuilder will contain the contents of the text file.
Also, because you want text, we can just use BinaryReader to read in characters, which will ensure that your buffers line up correctly when reading any multi-byte characters (UTF-8, UTF-16, etc.).
This is all done without using background tasks, threads, or complex custom state machines.
我的文件超过 13 GB:
下面的链接包含轻松读取一段文件的代码:
读取一个大文本文件
更多信息
My file is over 13 GB:
The bellow link contains the code that read a piece of file easily:
Read a large text file
More information