TStringList 拆分错误

发布于 2024-11-17 01:43:16 字数 4112 浏览 4 评论 0原文

最近,一位信誉良好的 SO 用户通知我,TStringList 存在拆分错误,这会导致它无法解析 CSV 数据。我尚未获悉这些错误的性质,并且在互联网上进行了搜索,包括 Quality Central 没有产生任何结果,所以我询问。什么是TStringList 拆分错误

注意,我对毫无根据的意见答案不感兴趣。


我所知道的:

不多......其中之一是,这些错误很少在测试数据中出现,但在现实世界中并不罕见。

另一个是,如前所述,它们会阻止正确解析 CSV。考虑到很难用测试数据重现这些错误,我(可能)正在向尝试在生产代码中使用字符串列表作为 CSV 解析器的人寻求帮助。

不相关的问题:

我获得了有关“Delphi-XE”标记问题的信息,因此由于“空格字符被视为分隔符”而导致解析失败 功能不适用。因为引入了StrictDelimiter Delphi 2006 的属性解决了这个问题。我自己使用的是Delphi 2007。

另外,由于字符串列表只能保存字符串,因此它只负责拆分字段。由于区域设置差异等引起的涉及字段值(日期、浮点数..)的任何转换困难不在范围内。

基本规则

CSV 没有标准规范。但从各种规范中推断出一些基本规则。

下面演示了 TStringList 如何处理这些。规则和示例字符串来自维基百科。括号 ([ ]) 叠加在字符串周围,以便能够通过测试代码看到前导或尾随空格(如果相关)。


空格被视为字段的一部分,不应被忽略。

Test string: [1997, Ford , E350]
Items: [1997] [ Ford ] [ E350]


嵌入逗号的字段必须用双引号字符引起。

Test string: [1997,Ford,E350,"Super, luxurious truck"]
Items: [1997] [Ford] [E350] [Super, luxurious truck]


< em>嵌入双引号字符的字段必须用双引号字符括起来,并且每个嵌入的双引号字符必须由一对双引号字符表示。

Test string: [1997,Ford,E350,"Super, ""luxurious"" truck"]
Items: [1997] [Ford] [E350] [Super, "luxurious" truck]


字段带有嵌入换行符的内容必须包含在双引号字符。

Test string: [1997,Ford,E350,"Go get one now
they are going fast"]
Items: [1997] [Ford] [E350] [Go get one now
they are going fast]


在修剪前导空格或尾随空格的 CSV 实现中,具有此类空格的字段必须括在双引号字符内。

Test string: [1997,Ford,E350," Super luxurious truck "]
Items: [1997] [Ford] [E350] [ Super luxurious truck ]


字段可以无论是否需要,始终用双引号字符括起来。

Test string: ["1997","Ford","E350"]
Items: [1997] [Ford] [E350]



测试代码:

var
  SL: TStringList;
  rule: string;

  function GetItemsText: string;
  var
    i: Integer;
  begin
    for i := 0 to SL.Count - 1 do
      Result := Result + '[' + SL[i] + '] ';
  end;

  procedure Test(TestStr: string);
  begin
    SL.DelimitedText := TestStr;
    Writeln(rule + sLineBreak, 'Test string: [', TestStr + ']' + sLineBreak,
            'Items: ' + GetItemsText + sLineBreak);
  end;

begin
  SL := TStringList.Create;
  SL.Delimiter := ',';        // default, but ";" is used with some locales
  SL.QuoteChar := '"';        // default
  SL.StrictDelimiter := True; // required: strings are separated *only* by Delimiter

  rule := 'Spaces are considered part of a field and should not be ignored.';
  Test('1997, Ford , E350');

  rule := 'Fields with embedded commas must be enclosed within double-quote characters.';
  Test('1997,Ford,E350,"Super, luxurious truck"');

  rule := 'Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.';
  Test('1997,Ford,E350,"Super, ""luxurious"" truck"');

  rule := 'Fields with embedded line breaks must be enclosed within double-quote characters.';
  Test('1997,Ford,E350,"Go get one now'#10#13'they are going fast"');

  rule := 'In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.';
  Test('1997,Ford,E350," Super luxurious truck "');

  rule := 'Fields may always be enclosed within double-quote characters, whether necessary or not.';
  Test('"1997","Ford","E350"');

  SL.Free;
end;



如果您已阅读全部内容,问题是:),什么是“ TStringList 分裂错误?”

Recently I've been informed by a reputable SO user, that TStringList has splitting bugs which would cause it to fail parsing CSV data. I haven't been informed about the nature of these bugs, and a search on the internet including Quality Central did not produce any results, so I'm asking. What are TStringList splitting bugs?

Please note, I'm not interested in unfounded opinion based answers.

What I know:

Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.

The other is, as stated, they prevent proper parsing of CSV. Thinking that it is difficult to reproduce the bugs with test data, I am (probably) seeking help from whom have tried using a string list as a CSV parser in production code.

Irrelevant problems:

I obtained the information on a 'Delphi-XE' tagged question, so failing parsing due to the "space character being considered as a delimiter" feature do not apply. Because the introduction of the StrictDelimiter property with Delphi 2006 resolved that. I, myself, am using Delphi 2007.

Also since the string list can only hold strings, it is only responsible for splitting fields. Any conversion difficulty involving field values (f.i. date, floating point numbers..) arising from locale differences etc. are not in scope.

Basic rules:

There's no standard specification for CSV. But there are basic rules inferred from various specifications.

Below is demonstration of how TStringList handles these. Rules and example strings are from Wikipedia. Brackets ([ ]) are superimposed around strings to be able to see leading or trailing spaces (where relevant) by the test code.

Spaces are considered part of a field and should not be ignored.

Test string: [1997, Ford , E350]
Items: [1997] [ Ford ] [ E350]

Fields with embedded commas must be enclosed within double-quote characters.

Test string: [1997,Ford,E350,"Super, luxurious truck"]
Items: [1997] [Ford] [E350] [Super, luxurious truck]

Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.

Test string: [1997,Ford,E350,"Super, ""luxurious"" truck"]
Items: [1997] [Ford] [E350] [Super, "luxurious" truck]

Fields with embedded line breaks must be enclosed within double-quote characters.

Test string: [1997,Ford,E350,"Go get one now
they are going fast"]
Items: [1997] [Ford] [E350] [Go get one now
they are going fast]

In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.

Test string: [1997,Ford,E350," Super luxurious truck "]
Items: [1997] [Ford] [E350] [ Super luxurious truck ]

Fields may always be enclosed within double-quote characters, whether necessary or not.

Test string: ["1997","Ford","E350"]
Items: [1997] [Ford] [E350]

Testing code:

var
  SL: TStringList;
  rule: string;

  function GetItemsText: string;
  var
    i: Integer;
  begin
    for i := 0 to SL.Count - 1 do
      Result := Result + '[' + SL[i] + '] ';
  end;

  procedure Test(TestStr: string);
  begin
    SL.DelimitedText := TestStr;
    Writeln(rule + sLineBreak, 'Test string: [', TestStr + ']' + sLineBreak,
            'Items: ' + GetItemsText + sLineBreak);
  end;

begin
  SL := TStringList.Create;
  SL.Delimiter := ',';        // default, but ";" is used with some locales
  SL.QuoteChar := '"';        // default
  SL.StrictDelimiter := True; // required: strings are separated *only* by Delimiter

  rule := 'Spaces are considered part of a field and should not be ignored.';
  Test('1997, Ford , E350');

  rule := 'Fields with embedded commas must be enclosed within double-quote characters.';
  Test('1997,Ford,E350,"Super, luxurious truck"');

  rule := 'Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.';
  Test('1997,Ford,E350,"Super, ""luxurious"" truck"');

  rule := 'Fields with embedded line breaks must be enclosed within double-quote characters.';
  Test('1997,Ford,E350,"Go get one now'#10#13'they are going fast"');

  rule := 'In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.';
  Test('1997,Ford,E350," Super luxurious truck "');

  rule := 'Fields may always be enclosed within double-quote characters, whether necessary or not.';
  Test('"1997","Ford","E350"');

  SL.Free;
end;

If you've read it all, the question was :), what are "TStringList splitting bugs?"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夏有森光若流苏 2024-11-24 01:43:16

不多......其中之一是,这些错误很少在测试数据中出现,但在现实世界中并不罕见。

所需要的只是一个案例。测试数据不是随机数据,一个有一个失败案例的用户应该提交数据,瞧,我们就得到了一个测试案例。如果没有人可以提供测试数据,也许没有错误/故障?

CSV 没有标准规范。

这确实有助于消除混乱。如果没有标准规范,如何证明某些事情是错误的?如果仅靠自己的直觉,你可能会遇到各种各样的麻烦。以下是我自己与政府发行的软件愉快互动的一些内容;我的应用程序应该以 CSV 格式导出数据,而政府应用程序应该导入它。这就是我们连续几年陷入很多麻烦的原因:

  • 如何表示空数据?由于没有 CSV 标准,有一年,我友好的政府决定一切都可以,包括什么都不做(两个连续的逗号)。接下来他们决定只有连续的逗号才可以,即Field,"",Field无效,应该是Field,,Field。向我的客户解释政府应用程序从一周到下一周更改了验证规则,这让我很开心……
  • 你们导出零整数数据吗?这可能是一个更大的滥用,但我的“政府应用程序”决定也验证这一点。曾经强制包含 0,后来强制不包含 0。也就是说,一次 Field,0,Field 有效,下一个 Field,,Field 是唯一有效的方法...

这是另一个测试用例,其中 (我的)直觉失败了:

1997年,福特,E350,“超级豪华卡车”

请注意 ,"Super 之间的空格,以及 "Super 后面的非常幸运的逗号代码>. TStrings 使用的解析器仅在紧跟在分隔符之后的情况下才能看到引号字符。该字符串被解析为:

[1997]
[ Ford]
[ E350]
[ "Super]
[ luxurious truck"]

直观上我期望:

[1997]
[ Ford]
[ E350]
[Super luxurious truck]

但猜猜看,Excel 的处理方式与 Delphi 的处理方式相同...

结论

  • TStrings.CommaText 相当好并且实现得很好,至少在 Delphi 中是这样我看到的 2010 版本非常有效(避免多次字符串分配,使用 PChar 来“遍历”已解析的字符串)并且工作方式与 Excel 的解析器大致相同。
  • 在现实世界中,您需要与使用其他库(或根本没有库)编写的其他软件交换数据,人们可能会错误地解释 CSV 的一些(缺失?)规则。你必须适应,这可能不是一个对错的问题,而是一个“我的客户需要导入这些垃圾”的情况。如果发生这种情况,您将不得不编写自己的解析器,该解析器可以适应您要处理的第 3 方应用程序的要求。在此之前,您可以安全地使用 TStrings。当它确实发生时,可能不是 TString 的错!

Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.

All it takes is one case. Test data is not random data, one user with one failure case should submit the data and voilà, we've got a test case. If no one can provide test data, maybe there's no bug/failure?

There's no standard specification for CSV.

That one sure helps with the confusion. Without a standard specification, how do you prove something is wrong? If this is left to one's own intuition, you might get into all kinds of troubles. Here's some from my own happy interaction with government issued software; My application was supposed to export data in CSV format, and the government application was supposed to import it. Here's what got us into a lot of trouble several years in a row:

  • How do you represent empty data? Since there's no CSV standard, one year my friendly gov decided anything goes, including nothing (two consecutive commas). Next they decided only consecutive commas are OK, that is, Field,"",Field is not valid, should be Field,,Field. Had a lot of fun explaining to my customers that the gov app changed validation rules from one week to the next...
  • Do you export ZERO integer data? This was probably an bigger abuse, but my "gov app" decided to validate that also. At one time it was mandatory to include the 0, then it was mandatory NOT to include the 0. That is, at one time Field,0,Field was valid, next Field,,Field was the only valid way...

And here's an other test-case where (my) intuition failed:

1997, Ford, E350, "Super, luxurious truck"

Please note the space between , and "Super, and the very lucky comma that follows "Super. The parser employed by TStrings only sees the quote char if it immediately follows the delimiter. That string is parsed as:

[1997]
[ Ford]
[ E350]
[ "Super]
[ luxurious truck"]

Intuitively I'd expect:

[1997]
[ Ford]
[ E350]
[Super luxurious truck]

But guess what, Excel does it the same way Delphi does it...

Conclusion

  • TStrings.CommaText is fairly good and nicely implemented, at least the Delphi 2010 version I looked at is quite effective (avoids multiple string allocations, uses a PChar to "walk" the parsed string) and works about the same as Excel's parser does.
  • In the real world you'll need to exchange data with other software, written using other libraries (or no libraries at all), where people might have miss-interpreted some of the (missing?) rules of CSV. You'll have to adapt, and it'll probably not be a case of right-or-wrong but a case of "my clients need to import this crap". If that happens, you'll have to write your own parser, one that adapts to the requirements of the 3rd party app you'd be dealing with. Until that happens, you can safely use TStrings. And when it does happen, it might not be TString's fault!
不忘初心 2024-11-24 01:43:16

我要大胆地说,最常见的失败案例是嵌入式换行符。我知道我所做的大多数 CSV 解析都忽略了这一点。我将使用 2 个 TStringList,其中 1 个用于我正在解析的文件,另一个用于当前行。因此,我最终会得到类似于以下内容的代码:

procedure Foo;
var
    CSVFile, ALine: TStringList;
    s: string;

begin
    CSVFile := TStringList.Create;
    ALine := TStringList.Create;
    ALine.StrictDelimiter := True;
    CSVFile.LoadFromFile('C:\Path\To\File.csv');
    for s in CSVFile do begin
        ALine.CommaText := s;
        DoSomethingInteresting(ALine);
    end;
end;

当然,由于我没有注意确保每一行都是“完整的”,因此我可能会遇到输入在字段中包含带引号的换行符的情况我很想念它。

除非我遇到现实世界的数据,这是一个问题,否则我不会费心去解决它。 :-P

I'm going to go out on a limb and say that the most common failure case is the embedded linebreak. I know most of the CSV parsing I do ignores that. I'll use 2 TStringLists, 1 for the file I'm parsing, the other for the current line. So I'll end up with code similar to the following:

procedure Foo;
var
    CSVFile, ALine: TStringList;
    s: string;

begin
    CSVFile := TStringList.Create;
    ALine := TStringList.Create;
    ALine.StrictDelimiter := True;
    CSVFile.LoadFromFile('C:\Path\To\File.csv');
    for s in CSVFile do begin
        ALine.CommaText := s;
        DoSomethingInteresting(ALine);
    end;
end;

Of course, since I'm not taking care to make sure that each line is "complete", I can potentially run into cases where the input contains a quoted linebreak in a field and I miss it.

Until I run into real world data where it's an issue, I'm not going to bother fixing it. :-P

嗼ふ静 2024-11-24 01:43:16

另一个例子...这个 TStringList.CommaText bug 存在于 Delphi 2009 中。TStringList.CommaText

procedure TForm1.Button1Click(Sender: TObject);
var
  list : TStringList;
begin
  list := TStringList.Create();
  try
    list.CommaText := '"a""';
    Assert(list.Count = 1);
    Assert(list[0] = 'a');
    Assert(list.CommaText = 'a'); // FAILS -- actual value is "a""
  finally
    FreeAndNil(list);
  end;
end;

setter 和相关方法破坏了保存 a 项的字符串的内存(其空终止符被<代码>“)。

Another example... this TStringList.CommaText bug exists in Delphi 2009.

procedure TForm1.Button1Click(Sender: TObject);
var
  list : TStringList;
begin
  list := TStringList.Create();
  try
    list.CommaText := '"a""';
    Assert(list.Count = 1);
    Assert(list[0] = 'a');
    Assert(list.CommaText = 'a'); // FAILS -- actual value is "a""
  finally
    FreeAndNil(list);
  end;
end;

The TStringList.CommaText setter and related methods corrupt the memory of the string that holds the a item (its null terminator character is overwritten by a ").

罪#恶を代价 2024-11-24 01:43:16

已经尝试过使用 TArray split 吗?

var
text: String;
arr: TArray<String>;
begin
text := '1997,Ford,E350';
arr := text.split([',']);

所以 arr 将是:

arr[0] = 1997;
arr[1] = Ford;
arr[2] = E350;

Already tried use TArray<String> split?

var
text: String;
arr: TArray<String>;
begin
text := '1997,Ford,E350';
arr := text.split([',']);

So arr would be:

arr[0] = 1997;
arr[1] = Ford;
arr[2] = E350;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文