Delphi XE - RawByteString 与 AnsiString

发布于 2024-11-09 04:36:04 字数 754 浏览 0 评论 0原文

我在这里有一个类似的问题: Delphi XE - 我应该使用 String还是 AnsiString? 。在决定在我的(大型)库中使用 ANSI 字符串是正确的之后,我意识到我实际上可以使用 RawByteString 而不是 ANSI。因为我将 UNICODE 字符串与 ANSI 字符串混合在一起,所以我的代码现在很少有地方可以在它们之间进行转换。然而,看起来如果我使用 RawByteString 我就可以摆脱这些转换。

请让我知道您对此的看法。
谢谢。


更新:
这似乎令人失望。看起来编译器仍然进行从 RawByteString 到字符串的转换。

procedure TForm1.FormCreate(Sender: TObject);
var x1, x2: RawByteString;
    s: string;
begin
  x1:= 'a';
  x2:= 'b';
  x1:= x1+ x2;
  s:= x1;              {      <------- Implicit string cast from 'RawByteString' to 'string'     }
end;

我认为它做了一些内部工作(例如复制数据),并且我的代码不会快得多,而且我仍然需要在代码中添加大量类型转换以使编译器保持沉默。

I had a similar question to this here: Delphi XE - should I use String or AnsiString? . After deciding that it is right to use ANSI strings in a (large) library of mine, I have realized that I can actually use RawByteString instead of ANSI. Because I mix UNICODE strings with ANSI strings, my code now has quite few places where it does conversions between them. However, it looks like if I use RawByteString I get rid of those conversions.

Please let me know your opinion about it.
Thanks.


Update:
This seems to be disappointing. It looks like the compiler still makes a conversion from RawByteString to string.

procedure TForm1.FormCreate(Sender: TObject);
var x1, x2: RawByteString;
    s: string;
begin
  x1:= 'a';
  x2:= 'b';
  x1:= x1+ x2;
  s:= x1;              {      <------- Implicit string cast from 'RawByteString' to 'string'     }
end;

I think it does some internal workings (such as copying data) and my code will not be much faster and I will still have to add lots of typecasts in my code in order to silence the compiler.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

影子的影子 2024-11-16 04:36:04

RawByteString 是一个 AnsiString,默认情况下没有设置代码页。

当您将另一个 string 分配给此 RawByteString 变量时,您将复制源 string 的代码页。这将包括转换。对不起。

但是RawByteString还有另一种用途,即存储纯字节内容(例如数据库BLOB字段内容,就像字节数组

总结一下:

  • RawByteString 应该用作方法或函数的“代码页不可知”参数;
  • RawByteString 可以用作变量类型来存储一些 BLOB 数据。

如果您想减少转换,并且宁愿在应用程序中使用 8 位字符 string,则最好:

  • 不要使用通用 AnsiString 类型,这将取决于当前系统代码页,您将通过它丢失数据;
  • 依赖 UTF-8 编码,即一些 8 位代码页/字符集,在从 UnicodeString 转换时不会丢失任何数据;
  • 不要让编译器显示关于隐式转换的警告:所有转换都应该是显式的;
  • 使用您自己的专用函数集来处理 UTF-8 内容。

这正是我们为框架所做的。我们希望在其内核中使用 UTF-8,因为:

  • 我们依赖 UTF-8 编码的 JSON 进行数据传输;
  • 内存消耗会更小;
  • 使用的SQLite3引擎会将文本作为UTF-8存储在其数据库文件中;
  • 我们想要一种在所有版本的 Delphi(从 Delphi 6 到 XE)中处理 Unicode 文本且不会丢失数据的方法,而 WideString 不是一个选择,因为它非常慢,而且您已经得到了隐式转换的同样问题。

但是,为了达到最佳速度,我们编写了一些优化函数来处理自定义字符串类型:

  {{ RawUTF8 is an UTF-8 String stored in an AnsiString
    - use this type instead of System.UTF8String, which behavior changed
     between Delphi 2009 compiler and previous versions: our implementation
     is consistent and compatible with all versions of Delphi compiler
    - mimic Delphi 2009 UTF8String, without the charset conversion overhead
    - all conversion to/from AnsiString or RawUnicode must be explicit }
{$ifdef UNICODE} RawUTF8 = type AnsiString(CP_UTF8); // Codepage for an UTF8string
{$else}          RawUTF8 = type AnsiString; {$endif}

/// our fast RawUTF8 version of Trim(), for Unicode only compiler
// - this Trim() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Trim(const S: RawUTF8): RawUTF8;

/// our fast RawUTF8 version of Pos(), for Unicode only compiler
// - this Pos() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Pos(const substr, str: RawUTF8): Integer; overload; inline;

并且我们保留了 RawByteString 类型来处理 BLOB 数据:

{$ifndef UNICODE}
  /// define RawByteString, as it does exist in Delphi 2009/2010/XE
  // - to be used for byte storage into an AnsiString
  // - use this type if you don't want the Delphi compiler not to do any
  // code page conversions when you assign a typed AnsiString to a RawByteString,
  // i.e. a RawUTF8 or a WinAnsiString
  RawByteString = AnsiString;
  /// pointer to a RawByteString
  PRawByteString = ^RawByteString;
{$endif}

/// create a File from a string content
// - uses RawByteString for byte storage, thatever the codepage is
function FileFromString(const Content: RawByteString; const FileName: TFileName;
  FlushOnDisk: boolean=false): boolean;

源代码可用 在我们的存储库中。在本单元中,对UTF-8相关功能进行了深度优化,同时提供了pascal和asm版本,以获得更好的速度。我们有时会重载默认函数(例如 Pos)以避免转换,或者有关我们如何在框架中处理文本的更多信息是 可在此处获取

最后一句话:

如果您确定您的应用程序中只有 7 位内容(无重音字符),则可以使用默认的 AnsiString 输入您的程序。但在这种情况下,您最好在 uses 子句中添加 AnsiStrings 单元,以具有重载的字符串函数,从而避免大多数不需要的转换。

RawByteString is an AnsiString with no code page set by default.

When you assign another string to this RawByteString variable, you'll copy the code page of the source string. And this will include a conversion. Sorry.

But there is one another use of RawByteString, which is to store plain byte content (e.g. a database BLOB field content, just like an array of byte)

To summarize:

  • RawByteString should be used as a "code page agnostic" parameter to a method or function;
  • RawByteString can be used as a variable type to store some BLOB data.

If you want to reduce conversion, and would rather use 8 bit char string in your application, you should better:

  • Do not use the generic AnsiString type, which will depend on the current system code page, and by which you'll loose data;
  • Rely on UTF-8 encoding, i.e. some 8 bit code page / charset which won't loose any data when converted from or to an UnicodeString;
  • Don't let the compiler show warnings about implicit conversions: all conversion should be made explicit;
  • Use your own dedicated set of functions to handle your UTF-8 content.

That exactly what we made for our framework. We wanted to use UTF-8 in its kernel because:

  • We rely on UTF-8 encoded JSON for data transmission;
  • Memory consumption will be smaller;
  • The used SQLite3 engine will store text as UTF-8 in its database file;
  • We wanted a way of handling Unicode text with no loose of data with all versions of Delphi (from Delphi 6 up to XE), and WideString was not an option because it's dead slow and you've got the same problem of implicit conversions.

But, in order to achieve best speed, we write some optimized functions to handle our custom string type:

  {{ RawUTF8 is an UTF-8 String stored in an AnsiString
    - use this type instead of System.UTF8String, which behavior changed
     between Delphi 2009 compiler and previous versions: our implementation
     is consistent and compatible with all versions of Delphi compiler
    - mimic Delphi 2009 UTF8String, without the charset conversion overhead
    - all conversion to/from AnsiString or RawUnicode must be explicit }
{$ifdef UNICODE} RawUTF8 = type AnsiString(CP_UTF8); // Codepage for an UTF8string
{$else}          RawUTF8 = type AnsiString; {$endif}

/// our fast RawUTF8 version of Trim(), for Unicode only compiler
// - this Trim() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Trim(const S: RawUTF8): RawUTF8;

/// our fast RawUTF8 version of Pos(), for Unicode only compiler
// - this Pos() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Pos(const substr, str: RawUTF8): Integer; overload; inline;

And we reserved the RawByteString type for handling BLOB data:

{$ifndef UNICODE}
  /// define RawByteString, as it does exist in Delphi 2009/2010/XE
  // - to be used for byte storage into an AnsiString
  // - use this type if you don't want the Delphi compiler not to do any
  // code page conversions when you assign a typed AnsiString to a RawByteString,
  // i.e. a RawUTF8 or a WinAnsiString
  RawByteString = AnsiString;
  /// pointer to a RawByteString
  PRawByteString = ^RawByteString;
{$endif}

/// create a File from a string content
// - uses RawByteString for byte storage, thatever the codepage is
function FileFromString(const Content: RawByteString; const FileName: TFileName;
  FlushOnDisk: boolean=false): boolean;

Source code is available in our repository. In this unit, UTF-8 related functions were deeply optimized, with both version in pascal and asm for better speed. We sometimes overloaded default functions (like Pos) to avoid conversion, or More information about how we handled text in the framework is available here.

Last word:

If you are sure that you will only have 7 bit content in your application (no accentuated characters), you may use the default AnsiString type in your program. But in this case, you should better add the AnsiStrings unit in your uses clause to have overloaded string functions which will avoid most unwanted conversion.

抠脚大汉 2024-11-16 04:36:04

RawByteString仍然是一个“AnsiString”。最好将其描述为“通用接收器”,这意味着它将采用分配时源字符串的代码页,而无需强制进行代码页转换。 RawByteString 的目的是用作函数参数,以便您在调用采用 AnsiStrings 的实用程序函数时,不会在具有不同代码页亲和性的 AnsiStrings 之间发生转换。

然而,在上面的例子中,您将本质上是 AnsiString 的内容分配给 UnicodeString,这导致转换。它必须进行转换,因为 RawByteString 具有基于 8 位的字符的有效负载,而字符串 (UnicodeString) 具有基于 16 位的字符的有效负载。

RawByteString is still an "AnsiString." It is best described as a "universal receiver" which means it will take on whatever the source-string's codepage is at the point of assignment without forcing a codepage conversion. RawByteString was intended to be used only as a function parameter so that you will, as you've discovered, not incur a conversion between AnsiStrings with differing code-page affinities when calling utility functions which take AnsiStrings.

However, in the case above, you're assigning what is essentially an AnsiString to a UnicodeString which will incur a conversion. It must do a conversion because the RawByteString has a payload of 8bit-based characters, whereas a string (UnicodeString) has a payload of 16bit-based characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文