在 C# 中解析文本文件并跳过一些内容

发布于 2024-07-19 05:29:29 字数 374 浏览 4 评论 0原文

我正在尝试解析具有标题和正文的文本文件。 在此文件的标题中,有对正文部分的行号引用。 例如:

SECTION_A 256
SECTION_B 344
SECTION_C 556

这意味着 SECTION_A 从第 256 行开始。

将此标题解析为字典,然后在必要时阅读这些部分的最佳方法是什么。

典型场景是:

  1. 解析标题并只读部分 SECTION_B
  2. 解析标题并读取每个部分的第一段。

数据文件相当大,我绝对不想将其全部加载到内存中然后对其进行操作。

我很感激你的建议。 我的环境是VS 2008和C# 3.5 SP1。

I'm trying to parse a text file that has a heading and the body. In the heading of this file, there are line number references to sections of the body. For example:

SECTION_A 256
SECTION_B 344
SECTION_C 556

This means, that SECTION_A starts in line 256.

What would be the best way to parse this heading into a dictionary and then when necessary read the sections.

Typical scenarios would be:

  1. Parse the header and read only section SECTION_B
  2. Parse the header and read fist paragraph of each section.

The data file is quite large and I definitely don't want to load all of it to the memory and then operate on it.

I'd appreciate your suggestions. My environment is VS 2008 and C# 3.5 SP1.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

吃素的狼 2024-07-26 05:29:29

你可以很容易地做到这一点。

问题分为三个部分。

1)如何找到文件中一行的开始位置。 执行此操作的唯一方法是从文件中读取行,并保留一个记录该行在文件中的起始位置的列表。 例如

List lineMap = new List();
lineMap.Add(0);    // Line 0 starts at location 0 in the data file (just a dummy entry)
lineMap.Add(0);    // Line 1 starts at location 0 in the data file

using (StreamReader sr = new StreamReader("DataFile.txt")) 
{
    String line;
    int lineNumber = 1;
    while ((line = sr.ReadLine()) != null)
        lineMap.Add(sr.BaseStream.Position);
}

2)读取索引文件并将其解析为字典。

Dictionary index = new Dictionary();

using (StreamReader sr = new StreamReader("IndexFile.txt")) 
{
    String line;
    while ((line = sr.ReadLine()) != null)
    {
        string[] parts = line.Split(' ');  // Break the line into the name & line number
        index.Add(parts[0], Convert.ToInt32(parts[1]));
    }
}

然后要在文件中查找一行,请使用:

int lineNumber = index["SECTION_B";];         // Convert section name into the line number
long offsetInDataFile = lineMap[lineNumber];  // Convert line number into file offset

然后在 DataFile.txt 上打开一个新的 FileStream,Seek(offsetInDataFile, SeekOrigin.Begin) 移动到行的开头,并使用 StreamReader(如上所述)读取行( s) 从它。

You can do this quite easily.

There are three parts to the problem.

1) How to find where a line in the file starts. The only way to do this is to read the lines from the file, keeping a list that records the start position in the file of that line. e.g

List lineMap = new List();
lineMap.Add(0);    // Line 0 starts at location 0 in the data file (just a dummy entry)
lineMap.Add(0);    // Line 1 starts at location 0 in the data file

using (StreamReader sr = new StreamReader("DataFile.txt")) 
{
    String line;
    int lineNumber = 1;
    while ((line = sr.ReadLine()) != null)
        lineMap.Add(sr.BaseStream.Position);
}

2) Read and parse your index file into a dictionary.

Dictionary index = new Dictionary();

using (StreamReader sr = new StreamReader("IndexFile.txt")) 
{
    String line;
    while ((line = sr.ReadLine()) != null)
    {
        string[] parts = line.Split(' ');  // Break the line into the name & line number
        index.Add(parts[0], Convert.ToInt32(parts[1]));
    }
}

Then to find a line in your file, use:

int lineNumber = index["SECTION_B";];         // Convert section name into the line number
long offsetInDataFile = lineMap[lineNumber];  // Convert line number into file offset

Then open a new FileStream on DataFile.txt, Seek(offsetInDataFile, SeekOrigin.Begin) to move to the start of the line, and use a StreamReader (as above) to read line(s) from it.

云仙小弟 2024-07-26 05:29:29

好吧,显然你可以将名称+行号存储到字典中,但这不会给你带来任何好处。

好吧,当然,它会让您知道从哪一行开始读取,但问题是,该行在文件中的哪个位置? 唯一知道的方法就是从头开始并开始计数。

最好的方法是编写一个包装器来解码文本内容(如果您有编码问题),并可以为您提供行号到字节位置类型的映射,然后您可以采用该行号 256 并在字典中查找知道第 256 行从文件中的位置 10000 开始,并从那里开始读取。

这是一次性处理的情况吗? 如果没有,您是否考虑过将整个文件填充到本地数据库(例如 SQLite 数据库)中? 这将允许您在行号及其内容之间进行直接映射。 当然,该文件会比原始文件更大,并且您需要将数据从文本文件复制到数据库,因此无论哪种方式都会产生一些开销。

Well, obviously you can store the name + line number into a dictionary, but that's not going to do you any good.

Well, sure, it will allow you to know which line to start reading from, but the problem is, where in the file is that line? The only way to know is to start from the beginning and start counting.

The best way would be to write a wrapper that decodes the text contents (if you have encoding issues) and can give you a line number to byte position type of mapping, then you could take that line number, 256, and look in a dictionary to know that line 256 starts at position 10000 in the file, and start reading from there.

Is this a one-off processing situation? If not, have you considered stuffing the entire file into a local database, like a SQLite database? That would allow you to have a direct mapping between line number and its contents. Of course, that file would be even bigger than your original file, and you'd need to copy data from the text file to the database, so there's some overhead either way.

划一舟意中人 2024-07-26 05:29:29

只需一次一行读取文件并忽略数据,直到找到所需的数据。 您不会有任何内存问题,但性能可能不会很好。 不过,您可以在后台线程中轻松完成此操作。

Just read the file one line at a time and ignore the data until you get to the ones you need. You won't have any memory issues, but performance probably won't be great. You can do this easily in a background thread though.

追星践月 2024-07-26 05:29:29

读取文件直到标头末尾,假设您知道标头在哪里。 分割您存储在空格上的字符串,如下所示:

Dictionary<string, int> sectionIndex = new Dictionary<string, int>();
List<string> headers = new List<string>(); // fill these with readline

foreach(string header in headers) {
    var s = header.Split(new[]{' '});
    sectionIndex.Add(s[0], Int32.Parse(s[1]));
}

找到您想要的字典条目,记录文件中读取的行数,然后循环直到到达该行号,然后读取直到到达下一部分起跑线。 我不知道您是否可以保证字典中键的顺序,因此您可能需要当前部分和下一部分的名称。

请务必进行一些错误检查,以确保您正在阅读的部分不在您正在阅读的部分之前,以及您能想到的任何其他错误情况。

Read the file until the end of the header, assuming you know where that is. Split the strings you've stored on whitespace, like so:

Dictionary<string, int> sectionIndex = new Dictionary<string, int>();
List<string> headers = new List<string>(); // fill these with readline

foreach(string header in headers) {
    var s = header.Split(new[]{' '});
    sectionIndex.Add(s[0], Int32.Parse(s[1]));
}

Find the dictionary entry you want, keep a count of the number of lines read in the file, and loop until you hit that line number, then read until you reach the next section's starting line. I don't know if you can guarantee the order of keys in the Dictionary, so you'd probably need the current and next section's names.

Be sure to do some error checking to make sure the section you're reading to isn't before the section you're reading from, and any other error cases you can think of.

无风消散 2024-07-26 05:29:29

您可以逐行阅读,直到捕获所有标题信息并停止(假设所有节指针都在标题中)。 您将获得节号和行号,以便稍后检索数据时使用。

string dataRow = "";

try
{
    TextReader tr = new StreamReader("filename.txt");

    while (true)
    {
        dataRow = tr.ReadLine();
        if (dataRow.Substring(1, 8) != "SECTION_")
            break;
        else
            //Parse line for section code and line number and log values
            continue;
    }
    tr.Close();
}
catch (Exception ex)
{
    MessageBox.Show(ex.Message);
}

You could read line by line until all the heading information is captured and stop (assuming all section pointers are in the heading). You would have the section and line numbers for use in retrieving the data at a later time.

string dataRow = "";

try
{
    TextReader tr = new StreamReader("filename.txt");

    while (true)
    {
        dataRow = tr.ReadLine();
        if (dataRow.Substring(1, 8) != "SECTION_")
            break;
        else
            //Parse line for section code and line number and log values
            continue;
    }
    tr.Close();
}
catch (Exception ex)
{
    MessageBox.Show(ex.Message);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文