如何在.NET 中读取大型(1 GB)txt 文件?

发布于 2024-10-04 11:52:51 字数 684 浏览 2 评论 0原文

我有一个 1 GB 的文本文件,需要逐行读取。最好和最快的方法是什么?

private void ReadTxtFile()
{            
    string filePath = string.Empty;
    filePath = openFileDialog1.FileName;
    if (string.IsNullOrEmpty(filePath))
    {
        using (StreamReader sr = new StreamReader(filePath))
        {
            String line;
            while ((line = sr.ReadLine()) != null)
            {
                FormatData(line);                        
            }
        }
    }
}

FormatData()中,我检查行的起始词,该词必须与一个词匹配,并基于该词递增一个整数变量。

void FormatData(string line)
{
    if (line.StartWith(word))
    {
        globalIntVariable++;
    }
}

I have a 1 GB text file which I need to read line by line. What is the best and fastest way to do this?

private void ReadTxtFile()
{            
    string filePath = string.Empty;
    filePath = openFileDialog1.FileName;
    if (string.IsNullOrEmpty(filePath))
    {
        using (StreamReader sr = new StreamReader(filePath))
        {
            String line;
            while ((line = sr.ReadLine()) != null)
            {
                FormatData(line);                        
            }
        }
    }
}

In FormatData() I check the starting word of line which must be matched with a word and based on that increment an integer variable.

void FormatData(string line)
{
    if (line.StartWith(word))
    {
        globalIntVariable++;
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

默嘫て 2024-10-11 11:52:51

如果您使用的是 .NET 4.0,请尝试 MemoryMappedFile 这是为此场景设计的类。

否则,您可以使用 StreamReader.ReadLine 。

If you are using .NET 4.0, try MemoryMappedFile which is a designed class for this scenario.

You can use StreamReader.ReadLine otherwise.

窝囊感情。 2024-10-11 11:52:51

使用 StreamReader 可能是一种方法,因为您不希望整个文件同时存在于内存中。 MemoryMappedFile 更适合随机访问而不是顺序读取(顺序读取的速度是顺序读取的十倍,内存映射是随机访问的十倍)。

您还可以尝试从 FileOptions 设置为 SequentialScan 的文件流创建流读取器(请参阅FileOptions 枚举),但我怀疑它会产生很大的影响。

然而,有一些方法可以使您的示例更有效,因为您在阅读时在同一循环中进行格式化。您正在浪费时钟周期,因此如果您想要更高的性能,最好使用多线程异步解决方案,其中一个线程读取数据,另一个线程在数据可用时对其进行格式化。查看可能适合您需求的 BlockingCollection:

阻塞集合和生产者-消费者问题

如果您想要尽可能快的性能,根据我的经验,唯一的方法是读入尽可能大的二进制文件按顺序读取数据并将其并行反序列化为文本,但此时代码开始变得复杂。

Using StreamReader is probably the way to since you don't want the whole file in memory at once. MemoryMappedFile is more for random access than sequential reading (it's ten times as fast for sequential reading and memory mapping is ten times as fast for random access).

You might also try creating your streamreader from a filestream with FileOptions set to SequentialScan (see FileOptions Enumeration), but I doubt it will make much of a difference.

There are however ways to make your example more effective, since you do your formatting in the same loop as reading. You're wasting clockcycles, so if you want even more performance, it would be better with a multithreaded asynchronous solution where one thread reads data and another formats it as it becomes available. Checkout BlockingColletion that might fit your needs:

Blocking Collection and the Producer-Consumer Problem

If you want the fastest possible performance, in my experience the only way is to read in as large a chunk of binary data sequentially and deserialize it into text in parallel, but the code starts to get complicated at that point.

护你周全 2024-10-11 11:52:51

您可以使用LINQ

int result = File.ReadLines(filePath).Count(line => line.StartsWith(word));

File.ReadLines 返回一个 IEnumerable延迟读取文件中的每一行,而不将整个文件加载到内存中。

Enumerable.Count 计算以单词开头的行数。

如果您从 UI 线程调用此函数,请使用 BackgroundWorker

You can use LINQ:

int result = File.ReadLines(filePath).Count(line => line.StartsWith(word));

File.ReadLines returns an IEnumerable<String> that lazily reads each line from the file without loading the whole file into memory.

Enumerable.Count counts the lines that start with the word.

If you are calling this from an UI thread, use a BackgroundWorker.

玩套路吗 2024-10-11 11:52:51

可能要逐行阅读。< /em>

你不应该试图通过读到结束然后处理来强制它进入内存。

Probably to read it line by line.

You should rather not try to force it into memory by reading to end and then processing.

西瑶 2024-10-11 11:52:51

StreamReader.ReadLine 应该可以正常工作。让框架选择缓冲,除非您知道通过分析可以做得更好。

StreamReader.ReadLine should work fine. Let the framework choose the buffering, unless you know by profiling you can do better.

红衣飘飘貌似仙 2024-10-11 11:52:51

我在 Agenty 的生产服务器中遇到了同样的问题,我们看到大文件(有时 10-25 GB( \t) 制表符分隔的 txt 文件)。经过大量测试和研究,我找到了使用 for/foreach 循环以小块形式读取大文件并使用 File.ReadLines() 设置偏移和限制逻辑的最佳方法。

int TotalRows = File.ReadLines(Path).Count(); // Count the number of rows in file with lazy load
int Limit = 100000; // 100000 rows per batch
for (int Offset = 0; Offset < TotalRows; Offset += Limit)
{
  var table = Path.FileToTable(heading: true, delimiter: '\t', offset : Offset, limit: Limit);

 // Do all your processing here and with limit and offset and save to drive in append mode
 // The append mode will write the output in same file for each processed batch.

  table.TableToFile(@"C:\output.txt");
}

请参阅我的 Github 库中的完整代码: https://github.com/Agenty/FileReader/

全面披露 - 我为 Agenty 工作,该公司拥有该图书馆和网站

I was facing same problem in our production server at Agenty where we see large files (sometimes 10-25 gb (\t) tab delimited txt files). And after lots of testing and research I found the best way to read large files in small chunks with for/foreach loop and setting offset and limit logic with File.ReadLines().

int TotalRows = File.ReadLines(Path).Count(); // Count the number of rows in file with lazy load
int Limit = 100000; // 100000 rows per batch
for (int Offset = 0; Offset < TotalRows; Offset += Limit)
{
  var table = Path.FileToTable(heading: true, delimiter: '\t', offset : Offset, limit: Limit);

 // Do all your processing here and with limit and offset and save to drive in append mode
 // The append mode will write the output in same file for each processed batch.

  table.TableToFile(@"C:\output.txt");
}

See the complete code in my Github library : https://github.com/Agenty/FileReader/

Full Disclosure - I work for Agenty, the company who owned this library and website

〃温暖了心ぐ 2024-10-11 11:52:51

我的文件超过 13 GB:

在此处输入图像描述

您可以使用我的类:

public static void Read(int length)
    {
        StringBuilder resultAsString = new StringBuilder();

        using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(@"D:\_Profession\Projects\Parto\HotelDataManagement\_Document\Expedia_Rapid.jsonl\Expedia_Rapi.json"))
        using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, length))
        {
            for (int i = 0; i < length; i++)
            {
                //Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
                int result = memoryMappedViewStream.ReadByte();

                if (result == -1)
                {
                    break;
                }

                char letter = (char)result;

                resultAsString.Append(letter);
            }
        }
    }

此代码将从头开始读取文件文本,直到您传递给方法Read(int length) 并填充 resultAsString 变量。

它将返回以下文本:

My file is over 13 GB:

enter image description here

You can use my class:

public static void Read(int length)
    {
        StringBuilder resultAsString = new StringBuilder();

        using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(@"D:\_Profession\Projects\Parto\HotelDataManagement\_Document\Expedia_Rapid.jsonl\Expedia_Rapi.json"))
        using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, length))
        {
            for (int i = 0; i < length; i++)
            {
                //Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
                int result = memoryMappedViewStream.ReadByte();

                if (result == -1)
                {
                    break;
                }

                char letter = (char)result;

                resultAsString.Append(letter);
            }
        }
    }

This code will read text of file from start to the length that you pass to the method Read(int length) and fill the resultAsString variable.

It will return the bellow text:

猫性小仙女 2024-10-11 11:52:51

我一次读取文件 10,000 个字节。然后我会分析这 10,000 个字节,将它们分成几行,然后将它们提供给 FormatData 函数。

在多个线程上分割读取和行分析的奖励积分。

我肯定会使用 StringBuilder 收集所有字符串,并可能构建一个字符串缓冲区以始终在内存中保留大约 100 个字符串。

I'd read the file 10,000 bytes at a time. Then I'd analyse those 10,000 bytes and chop them into lines and feed them to the FormatData function.

Bonus points for splitting the reading and line analysation on multiple threads.

I'd definitely use a StringBuilder to collect all strings and might build a string buffer to keep about 100 strings in memory all the time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文