需要使用 StreamReader.ReadLine() 获取行终止符

发布于 2024-07-14 23:29:37 字数 563 浏览 8 评论 0原文

我编写了一个 C# 程序来读取 Excel .xls/.xlsx 文件并输出为 CSV 和 Unicode 文本。 我编写了一个单独的程序来删除空白记录。 这是通过使用 StreamReader.ReadLine() 读取每一行,然后逐个字符地遍历字符串来完成的,如果该行包含所有逗号(对于 CSV)或所有制表符,则不将该行写入输出(对于 Unicode 文本)。

当 Excel 文件的单元格内包含嵌入的换行符 (\x0A) 时,会出现此问题。 我将 XLS 更改为 CSV 转换器以找到这些新行(因为它逐个单元地进行)并将它们写为 \x0A,而普通行只需使用 StreamWriter.WriteLine()。

问题出现在删除空白记录的单独程序中。 当我使用 StreamReader.ReadLine() 读取时,根据定义,它只返回带有行的字符串,而不是终止符。 由于嵌入的换行符显示为两个单独的行,因此当我将它们写入最终文件时,我无法分辨哪一个是完整记录,哪一个是嵌入式换行符。

我什至不确定我是否可以读取 \x0A,因为输入上的所有内容都注册为“\n”。 我可以一个字符一个字符地进行操作,但这破坏了我删除空行的逻辑。

I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).

The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().

The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.

I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

記憶穿過時間隧道 2024-07-21 23:29:37

我建议您更改架构,使其更像编译器中的解析器。

您想要创建一个返回标记序列的词法分析器,然后创建一个读取标记序列并对其进行处理的解析器。

在您的情况下,标记将是:

  1. 列数据
  2. 逗号
  3. 行尾

您会将 '\n' ('\x0a') 本身视为嵌入的新行,因此将其作为列数据标记的一部分。 '\r\n' 将构成行结束标记。

这样做的优点是:

  1. 只对数据进行 1 次传递
  2. 只存储最多 1 行数据
  3. 重用尽可能多的内存(用于字符串生成器和列表)
  4. 如果您的需求发生变化,很容易更改

这是一个示例Lexer 会是什么样子:

免责声明:我什至还没有编译过这段代码,更不用说测试了,所以您需要清理它并确保它有效。

enum TokenType
{
    ColumnData,
    Comma,
    LineTerminator
}

class Token
{
    public TokenType Type { get; private set;}
    public string Data { get; private set;}

    public Token(TokenType type)
    {
        Type = type;
    }

    public Token(TokenType type, string data)
    {
        Type = type;
        Data = data;
    }
}

private  IEnumerable<Token> GetTokens(TextReader s)
{
   var builder = new StringBuilder();

   while (s.Peek() >= 0)
   {
       var c = (char)s.Read();
       switch (c)
       {
           case ',':
           {
               if (builder.Length > 0)
               {
                   yield return new Token(TokenType.ColumnData, ExtractText(builder));
               }
               yield return new Token(TokenType.Comma);
               break;
           }
           case '\r':
           {
                var next = s.Peek();
                if (next == '\n')
                {
                    s.Read();
                }

                if (builder.Length > 0)
                {
                    yield return new Token(TokenType.ColumnData, ExtractText(builder));
                }
                yield return new Token(TokenType.LineTerminator);
                break;
           }
           default:
               builder.Append(c);
               break;
       }

   }

   s.Read();

   if (builder.Length > 0)
   {
       yield return new Token(TokenType.ColumnData, ExtractText(builder));
   }
}

private string ExtractText(StringBuilder b)
{
    var ret = b.ToString();
    b.Remove(0, b.Length);
    return ret;
}

您的“解析器”代码将如下所示:

public void ConvertXLS(TextReader s)
{
    var columnData = new List<string>();
    bool lastWasColumnData = false;
    bool seenAnyData = false;

    foreach (var token in GetTokens(s))
    {
        switch (token.Type)
        {
            case TokenType.ColumnData:
            {
                 seenAnyData = true;
                 if (lastWasColumnData)
                 {
                     //TODO: do some error reporting
                 }
                 else
                 {
                     lastWasColumnData = true;
                     columnData.Add(token.Data);
                 }
                 break;
            }
            case TokenType.Comma:
            {
                if (!lastWasColumnData)
                {
                    columnData.Add(null);
                }
                lastWasColumnData = false;
                break;
            }
            case TokenType.LineTerminator:
            {
                if (seenAnyData)
                {
                    OutputLine(lastWasColumnData);
                }
                seenAnyData = false;
                lastWasColumnData = false;
                columnData.Clear();
            }
        }
    }

    if (seenAnyData)
    {
        OutputLine(columnData);
    }
}

I would recommend that you change your architecture to work more like a parser in a compiler.

You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.

In your case the tokens would be:

  1. Column data
  2. Comma
  3. End of Line

You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.

This has the advantages of:

  1. Doing only 1 pass over the data
  2. Only storing a max of 1 lines worth of data
  3. Reusing as much memory as possible (for the string builder and the list)
  4. It's easy to change should your requirements change

Here's a sample of what the Lexer would look like:

Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.

enum TokenType
{
    ColumnData,
    Comma,
    LineTerminator
}

class Token
{
    public TokenType Type { get; private set;}
    public string Data { get; private set;}

    public Token(TokenType type)
    {
        Type = type;
    }

    public Token(TokenType type, string data)
    {
        Type = type;
        Data = data;
    }
}

private  IEnumerable<Token> GetTokens(TextReader s)
{
   var builder = new StringBuilder();

   while (s.Peek() >= 0)
   {
       var c = (char)s.Read();
       switch (c)
       {
           case ',':
           {
               if (builder.Length > 0)
               {
                   yield return new Token(TokenType.ColumnData, ExtractText(builder));
               }
               yield return new Token(TokenType.Comma);
               break;
           }
           case '\r':
           {
                var next = s.Peek();
                if (next == '\n')
                {
                    s.Read();
                }

                if (builder.Length > 0)
                {
                    yield return new Token(TokenType.ColumnData, ExtractText(builder));
                }
                yield return new Token(TokenType.LineTerminator);
                break;
           }
           default:
               builder.Append(c);
               break;
       }

   }

   s.Read();

   if (builder.Length > 0)
   {
       yield return new Token(TokenType.ColumnData, ExtractText(builder));
   }
}

private string ExtractText(StringBuilder b)
{
    var ret = b.ToString();
    b.Remove(0, b.Length);
    return ret;
}

Your "parser" code would then look like this:

public void ConvertXLS(TextReader s)
{
    var columnData = new List<string>();
    bool lastWasColumnData = false;
    bool seenAnyData = false;

    foreach (var token in GetTokens(s))
    {
        switch (token.Type)
        {
            case TokenType.ColumnData:
            {
                 seenAnyData = true;
                 if (lastWasColumnData)
                 {
                     //TODO: do some error reporting
                 }
                 else
                 {
                     lastWasColumnData = true;
                     columnData.Add(token.Data);
                 }
                 break;
            }
            case TokenType.Comma:
            {
                if (!lastWasColumnData)
                {
                    columnData.Add(null);
                }
                lastWasColumnData = false;
                break;
            }
            case TokenType.LineTerminator:
            {
                if (seenAnyData)
                {
                    OutputLine(lastWasColumnData);
                }
                seenAnyData = false;
                lastWasColumnData = false;
                columnData.Clear();
            }
        }
    }

    if (seenAnyData)
    {
        OutputLine(columnData);
    }
}
夜巴黎 2024-07-21 23:29:37

您无法更改 StreamReader 以返回行终止符,也无法更改它用于行终止的内容。

我并不完全清楚你正在做什么转义的问题,特别是“并将它们写为 \x0A”。 该文件的示例可能会有所帮助。

听起来您可能需要逐个字符地工作,或者可能首先加载整个文件并进行全局替换,例如

x.Replace("\r\n", "\u0000") // Or some other unused character
 .Replace("\n", "\\x0A") // Or whatever escaping you need
 .Replace("\u0000", "\r\n") // Replace the real line breaks

我确信您可以使用正则表达式来做到这一点,并且可能会更多高效,但我发现很长的路更容易理解:)尽管必须进行全局替换有点麻烦 - 希望通过更多信息我们能提出更好的解决方案。

You can't change StreamReader to return the line terminators, and you can't change what it uses for line termination.

I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.

It sounds like you may need to work character by character, or possibly load the whole file first and do a global replace, e.g.

x.Replace("\r\n", "\u0000") // Or some other unused character
 .Replace("\n", "\\x0A") // Or whatever escaping you need
 .Replace("\u0000", "\r\n") // Replace the real line breaks

I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.

一指流沙 2024-07-21 23:29:37

本质上,Excel 中的硬回车(shift+enter 或 alt+enter,我不记得了)会在我用来编写 CSV 的默认编码中放置一个相当于 \x0A 的换行符。 当我写入 CSV 时,我使用 StreamWriter.WriteLine(),它输出该行加上换行符(我认为是 \r\n)。

CSV 很好,并且准确地显示了 Excel 保存它的方式,问题是当我将其读入空白记录删除器时,我使用 ReadLine() 它将把嵌入换行符的记录视为 CRLF。

这是我转换为 CSV 后的文件示例...

Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)

如您所见,第一条记录在 al-Numan 之后嵌入了换行符。 当我使用 ReadLine() 时,我得到 '1050,"Aziz Salih al-Numan',当我写出该值时,WriteLine() 以 CRLF 结束该行。我丢失了原始行终止符。当我再次使用 ReadLine() 时,我得到以“1050a”开头的行,

我可以读取整个文件并替换它们,但随后我必须将它们替换回来,基本上我想做的是获取行终止符来确定其是否为 \ 。 x0a 或 CRLF,然后如果它是 \x0A,我将使用 Write() 并插入该终止符。

Essentially, a hard-return in Excel (shift+enter or alt+enter, I can't remember) puts a newline that is equivalent to \x0A in the default encoding I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine(), which outputs the line plus a newline (which I believe is \r\n).

The CSV is fine and comes out exactly how Excel would save it, the problem is when I read it into the blank record remover, I'm using ReadLine() which will treat a record with an embedded newline as a CRLF.

Here's an example of the file after I convert to CSV...

Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)

As you can see, the first record has an embedded new-line after al-Numan. When I use ReadLine(), I get '1050,"Aziz Salih al-Numan' and when I write that out, WriteLine() ends that line with a CRLF. I lose the original line terminator. When I use ReadLine() again, I get the line starting with '1050a'.

I could read the entire file in and replace them, but then I'd have to replace them back afterwards. Basically what I want to do is get the line terminator to determine if its \x0a or a CRLF, and then if its \x0A, I'll use Write() and insert that terminator.

江湖正好 2024-07-21 23:29:37

我知道我在这里玩游戏有点晚了,但我遇到了同样的问题,而且我的解决方案比大多数给出的解决方案简单得多。

如果您能够确定列数,这应该很容易做到,因为第一行通常是列标题,您可以根据预期列数检查您的列数。 如果列数不等于预期的列数,您只需将当前行与之前不匹配的行连接起来即可。 例如:

string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
    if (lineCount == 0)
    {
        lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
        columnCount = lineData.length;
        ++lineCount;
        continue;
    }
    string thisLine = lastLine + currentLine;

    lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
    if (lineData.Length < columnCount)
    {
        lastLine += currentLine;
        continue;
    }
    else
    {
        lastLine = null;
    }
    ......

I know I'm a little late to the game here, but I was having the same problem and my solution was a lot simpler than most given.

If you are able to determine the column count which should be easy to do since the first line is usually the column titles, you can check your column count against the expected column count. If the column count doesn't equal the expected column count, you simply concatenate the current line with the previous unmatched lines. For example:

string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
    if (lineCount == 0)
    {
        lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
        columnCount = lineData.length;
        ++lineCount;
        continue;
    }
    string thisLine = lastLine + currentLine;

    lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
    if (lineData.Length < columnCount)
    {
        lastLine += currentLine;
        continue;
    }
    else
    {
        lastLine = null;
    }
    ......
墟烟 2024-07-21 23:29:37

非常感谢您的代码和其他一些代码,我想出了以下解决方案! 我在底部添加了一个指向我编写的一些代码的链接,这些代码使用了此页面中的一些逻辑。 我想我应该给予应有的荣誉! 谢谢!

下面是关于我需要什么的解释:
试试这个,我写这个是因为我有一些非常大的“|” 某些列内有 \r\n 的分隔文件,我需要使用 \r\n 作为行分隔符的末尾。 我尝试使用 SSIS 包导入一些文件,但由于文件中的某些数据已损坏,我无法导入。 该文件超过 5 GB,因此太大,无法打开并手动修复。 我通过浏览大量论坛以了解流的工作原理找到了答案,并最终提出了一个解决方案,该解决方案读取文件中的每个字符并根据我添加到其中的定义吐出该行。 这是用于命令行应用程序的,并带有帮助:)。 我希望这可以帮助其他人,尽管这些想法受到了这个论坛和其他人的启发,但我还没有在其他地方找到类似的解决方案。

https://stackoverflow.com/a/12640862/1582188

Thank you so much with your code and some others I came up with the following solution! I have added a link at the bottom to some code I wrote that used some of the logic from this page. I figured I'd give honor where honor was due! Thanks!

Below is a explanation about what I needed:
Try This, I wrote this because I have some very large '|' delimited files that have \r\n inside of some of the columns and I needed to use \r\n as the end of the line delimiter. I was trying to import some files using SSIS packages but because of some corrupted data in the files I was unable to. The File was over 5 GB so it was too large to open and manually fix. I found the answer through looking through lots of Forums to understand how streams work and ended up coming up with a solution that reads each character in a file and spits out the line based on the definitions I added into it. this is for use in a Command Line Application, complete with help :). I hope this helps some other people out, I haven't found a solution quite like it anywhere else, although the ideas were inspired by this forum and others.

https://stackoverflow.com/a/12640862/1582188

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文