是否应该使用正则表达式来分解行以解决 unix/dos 问题？

发布于 2024-12-21 17:53:36 字数 643 浏览 0 评论 0原文

我不想使用 XML 作为 T4 的输入文件，所以我制作了这个片段，将文档分割成由空行分隔的块。

我在这里是否适当地使回车符可选？

string s = @"Default
Default

CurrencyConversion
Details of currency conversions.

BudgetReportCache
Indicates wheather the budget report is taken from query results or cache.";

string oneLine = @"[\r]\n";
string twoLines = @"[\r]\n[\r]\n";

var chunks = Regex.Split(s, twoLines, RegexOptions.Multiline);

var items = chunks.Select(c=>Regex.Split(c, oneLine, RegexOptions.Multiline)).ToDictionary(c=>c[0], c=>c[1]);

注意：我从来没有想到过这一点，但是自从我开始使用 Git 以来，我看到它“说”了一些让我想起 unix2dos 问题的事情，这反过来又让我想到了 Mono，最后如果我需要处理可移植性的话（假设目标是完美）。

原文

I didn't feel like using XML for the input file of my T4 so I made this snippet that splits up a document into chunks separated by a blank line.

Am I appropriately making the carriage return optional here?

string s = @"Default
Default

CurrencyConversion
Details of currency conversions.

BudgetReportCache
Indicates wheather the budget report is taken from query results or cache.";

string oneLine = @"[\r]\n";
string twoLines = @"[\r]\n[\r]\n";

var chunks = Regex.Split(s, twoLines, RegexOptions.Multiline);

var items = chunks.Select(c=>Regex.Split(c, oneLine, RegexOptions.Multiline)).ToDictionary(c=>c[0], c=>c[1]);

Note: I would never have thought of this, but since I started using Git, I have seen it "say" things that reminded me of the unix2dos issues, which in turn made me think of Mono and finally if I needed to deal with portability (assuming the goal is perfection).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一世旳自豪 2024-12-28 17:53:36

您的正则表达式并不执行您认为的操作。将 \r 放入集合中不会完成任何事情；表达式 [\r]\n 与 \r\n 含义相同。

您可以使用 ? 运算符完成这项工作：

string oneLine = @"\r?\n";
string twoLines = @"\r?\n\r?\n";

但是，我建议您使用常规 String.Split 方法而不是正则表达式：

string[] oneLine = { @"\r\n", @"\n" };
string[] twoLines = { @"\r\n\r\n", @"\n\n" };

var chunks = s.Split(twoLines, StringSplitOptions.None);

var items =
  chunks.Select(c => c.Split(oneLine, StringSplitOptions.None))
  .ToDictionary(c => c[0], c => c[1]);

Your regular expressions doesn't do what you think that they do. Putting \r inside a set doesn't accomplish anything; the expression [\r]\n means the same thing as just \r\n.

You can make the work using the ? operator:

string oneLine = @"\r?\n";
string twoLines = @"\r?\n\r?\n";

However, I would suggest that you use the regular String.Split method instead of regular expressions:

string[] oneLine = { @"\r\n", @"\n" };
string[] twoLines = { @"\r\n\r\n", @"\n\n" };

var chunks = s.Split(twoLines, StringSplitOptions.None);

var items =
  chunks.Select(c => c.Split(oneLine, StringSplitOptions.None))
  .ToDictionary(c => c[0], c => c[1]);

回复收藏 0 原文

七度光 2024-12-28 17:53:36

是的，您应该允许不同的行分隔符，但这不是您的做法。方括号不会使其内容成为可选，并且您不会考虑旧的 Mac 风格的 \r 。我会使用这些正则表达式：

string oneLine = @"\r\n|[\r\n]";
string twoLines = @"(?:\r\n|[\r\n]){2}";

即“回车+换行或回车或换行”。

另外，您不需要多行选项。它只会更改您没有使用（也不需要使用）的 ^ 和 $ 锚点的含义。

Yes, you should allow for different line separators, but that's not how you do it. The square brackets don't make their contents optional, and you aren't taking the old Mac-style \r into account. I'd use these regexes:

string oneLine = @"\r\n|[\r\n]";
string twoLines = @"(?:\r\n|[\r\n]){2}";

That's "carriage-return + linefeed OR carriage-return OR linefeed".

Also, you don't need the Multiline option. It only changes the meaning of the ^ and $ anchors, which you aren't using (and don't need to use).

回复收藏 0 原文

千柳 2024-12-28 17:53:36

如果你想完全专注于可移植性（是的，我只是为了回应 Alan 提到的旧 Mac 风格 \r 而添加这个答案），那么你想要涵盖：

*nix style: \n

DOS/Windows 风格：\r\n

旧 Mac 风格：\r

EBCDIC 风格：\u0085（可能稍微更现代一些）使用比旧麦克，我猜）。

行分隔符格式字符：\u2028

段落分隔符格式字符：\u2029

让我们不要过多关注 \u000B 和 < 的精确语义。 code>\u000C 并将其变成合理的东西（最终）。如果我们要尝试解决所有这些问题。我们该怎么做呢？

有 6 个不同的换行符，其中一个是其他两个换行符的组合，但不应将其视为两个换行符，在 reg-ex 本身中处理此问题可能会很麻烦。

更好的方法是在 TextReader 包装器中将它们全部过滤掉：

public class LineBreakNormaliser : TextReader
{
  private readonly TextReader _source;
  private bool isNewLine(int charAsInt)
  {
    switch(charAsInt)
    {
      case '\n': case '\r':
      case '\u0085': case '\u2028': case '\u2029':
      case '\u000B': case '\u000C':
        return true;
      default:
        return false;
    }
  }
  public LineBreakNormaliser(TextReader source)
  {
    _source = source;
  }
  public override void Close()
  {
    _source.Close();
    base.Close();
  }
  protected override void Dispose(bool disposing)
  {
    if(disposing)
      _source.Dispose();
    base.Dispose(disposing);
  }
  public override int Peek()
  {
    int i = _source.Peek();
    if(i == -1)
      return -1;
    if(isNewLine(i))
      return '\n';
    return i;
  }
  public override int Read()
  {
    int i = _source.Read();
    if(i == -1)
      return -1;
    if(i == '\r')
    {
      if(_source.Peek() == '\n')
        _source.Read(); //eat next half of CRLF pair.
      return i;
    }
    if(isNewLine(i))
      return '\n';
    return i;
  }
  public override int Read(char[] buffer, int index, int count)
  {
    //We take advantage of the fact that we are allowed to return fewer than requested.
    //ReadBlock does the work for us for those who need the full amount:
    char[] tmpBuffer = new char[count];
    int cChars = count = _source.Read(tmpBuffer, 0, count);
    if(cChars == 0)
      return 0;
    for(int i = 0; i != cChars; ++i)
    {
      char cur = tmpBuffer[i];
      if(cur == '\r')
      {
        if(i == cChars -1)
        {
          if(_source.Peek() == '\n')
          {
            _source.Read(); //eat second half of CRLF
            --count;
          }
        }
        else if(tmpBuffer[i + 1] == '\r')
        {
          ++i;
          --count;
        }
        buffer[index++] = '\n';
      }
      else if(isNewLine(cur))
        buffer[index++] = '\n';
      else
        buffer[index++] = '\n';
    }
    return count;
  }
}

如果您通过此文本阅读器读取文件，那么从此时起，您的正则表达式可以依赖唯一的换行符 \n ，因此任何换行符都可以其他代码。

完成此操作后，正则表达式实际上可以比以往任何时候都更简单，虽然对于这种情况来说它完全是多余的（只是因为在 Alan 提到 OS9 和之前支持 IBM EBCDIC 机器的想法让我感到有趣之后才编写），但它对于所有人都是可重用的在其他情况下，在这种情况下它实际上根本没有过度杀戮，因为它变成“只需使用经过良好测试的行标准化器来使事情变得更简单”。（一旦经过充分测试，也就是说，我还没有测试上述任何内容）。

If you want to go full hog on portability (and yes, I'm only adding this answer in response to Alan's mentioning of old Mac-style \r) then you want to cover:

*nix style: \n

DOS/Windows style: \r\n

Old Mac style: \r

EBCDIC style: \u0085 (probably slightly more current-day use than old mac, I'd guess).

Line-separator formatting character: \u2028

Paragraph-separator formatting character: \u2029

Let's just not dwell on the precise semantics of \u000B and \u000C and turn this into something sensible (eventually). If we were to try to deal with all of those. How would we do it?

With 6 different line-breaks, one of which is a combination of two of the others, but which should not be treated as two line-breaks, dealing with this in the reg-ex itself could be nasty.

Much better would be to filter them all out in a TextReader wrapper:

public class LineBreakNormaliser : TextReader
{
  private readonly TextReader _source;
  private bool isNewLine(int charAsInt)
  {
    switch(charAsInt)
    {
      case '\n': case '\r':
      case '\u0085': case '\u2028': case '\u2029':
      case '\u000B': case '\u000C':
        return true;
      default:
        return false;
    }
  }
  public LineBreakNormaliser(TextReader source)
  {
    _source = source;
  }
  public override void Close()
  {
    _source.Close();
    base.Close();
  }
  protected override void Dispose(bool disposing)
  {
    if(disposing)
      _source.Dispose();
    base.Dispose(disposing);
  }
  public override int Peek()
  {
    int i = _source.Peek();
    if(i == -1)
      return -1;
    if(isNewLine(i))
      return '\n';
    return i;
  }
  public override int Read()
  {
    int i = _source.Read();
    if(i == -1)
      return -1;
    if(i == '\r')
    {
      if(_source.Peek() == '\n')
        _source.Read(); //eat next half of CRLF pair.
      return i;
    }
    if(isNewLine(i))
      return '\n';
    return i;
  }
  public override int Read(char[] buffer, int index, int count)
  {
    //We take advantage of the fact that we are allowed to return fewer than requested.
    //ReadBlock does the work for us for those who need the full amount:
    char[] tmpBuffer = new char[count];
    int cChars = count = _source.Read(tmpBuffer, 0, count);
    if(cChars == 0)
      return 0;
    for(int i = 0; i != cChars; ++i)
    {
      char cur = tmpBuffer[i];
      if(cur == '\r')
      {
        if(i == cChars -1)
        {
          if(_source.Peek() == '\n')
          {
            _source.Read(); //eat second half of CRLF
            --count;
          }
        }
        else if(tmpBuffer[i + 1] == '\r')
        {
          ++i;
          --count;
        }
        buffer[index++] = '\n';
      }
      else if(isNewLine(cur))
        buffer[index++] = '\n';
      else
        buffer[index++] = '\n';
    }
    return count;
  }
}

If you read the file via this text reader, then from this point on your regex can depend the only newline being \n and so can any other code.

This done, the regex can actually be simpler than ever, and you while it's totally overkill for this single case (and only written because after Alan's mention of OS9 and earlier the idea of supporting IBM EBCDIC machines amused me), it is reusable for all other cases, in which context it's actually not over-kill at all, because it becomes "just use the well-tested line-normaliser to make things simpler". (Once it is well-tested that is, I haven't tested any of the above).

回复收藏 0 原文

~没有更多了~