模板化字符串类使用 strcmp、strcpy 和 strlen

发布于 2024-12-23 14:05:13 字数 325 浏览 2 评论 0原文

我不久前无意中听到有人讨论在创建模板化字符串类时，对于可以使用 UTF8 和 UTF16 的模板化字符串类，不应使用 strcmp、strcpy 和 strlen。据我所知，您应该使用algorithm.h中的函数，但是，我不记得实现是如何的，或者为什么会这样。有人可以解释一下应该使用哪些函数、如何使用它们以及为什么使用它们吗？

模板化字符串类的示例如下所示，

String<UTF8> utf8String;
String<UTF16> utf16String;

其中 UTF8 是无符号字符，UTF16 是无符号短整型。

原文

I overheard sometime ago a discussion about how when creating a templated string class that you should not use strcmp, strcpy and strlen for a templated string class that can make use of UTF8 and UTF16. From what I recall, you are suppose to use functions from algorithm.h, however, I do not remember how the implementation is, or why it is so. Could someone please explain what functions to use instead, how to use them and why?

The example of the templated string class would be something such as

String<UTF8> utf8String;
String<UTF16> utf16String;

This is where UTF8 will be a unsigned char and UTF16 is an unsigned short.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一百个冬季 2024-12-30 14:05:13

首先，C++ 不需要额外的字符串类。可能已经开发了数百或数千个字符串类，而您的字符串类不会改善这种情况。除非你这样做纯粹是为了自我启发，否则你应该仔细思考，然后决定不再写一篇新的文章。

您可以使用 std::basic_string 保存 UTF-8 代码单元序列，std::basic_string 保存 UTF-16 代码单元序列，< code>std::basic_string来保存 UTF-32 代码单元序列等。C++ 甚至提供了简短、方便的名称这些类型：string、u16string 和 u32string。 basic_string 已经解决了您在这里提出的问题，它提供了用于复制、比较和获取适用于您模板化的任何代码单元的字符串长度的成员函数。

我想不出任何充分的理由让不与遗留代码交互的新代码使用其他任何东西作为字符串的规范存储类型。即使您确实与使用其他东西的遗留代码进行交互，如果该接口的表面积不大，您可能仍然应该使用标准类型之一而不是其他任何东西，当然，如果您正在与遗留代码进行交互无论如何，您将使用旧类型，而不是编写自己的新类型。

话虽如此，您不能将 strcmp、strcpy 和 strlen 用于模板化字符串类型的原因是它们都对 null 进行操作终止字节序列。如果您的代码单元大于 1 个字节，则在实际终止 null 代码单元之前可能有为零的字节（假设您完全使用 null 终止，您可能会这样做）不应该）。考虑字符串“Hello”的 UTF-16 表示形式（在小端机器上）。

48 00 65 00 6c 00 6c 00  6f 00

由于 UTF-16 使用 16 位代码单元，因此字符“H”最终存储为两个字节 48 00。通过假设第一个空字节是结尾来对上述字节序列进行操作的函数将假设第一个字符的后半部分标记整个字符串的结尾。这显然行不通。

因此，strcmp、strcpy 和 strlen 都是可以更普遍实现的算法的专用版本。由于它们仅适用于字节序列，并且您需要使用代码单元序列，其中代码单元可能大于字节，因此您需要可以适用于任何代码单元的通用算法。标准库提供了许多通用算法供您使用。以下是我对替换这些 str* 函数的建议。

strcmp 比较两个代码单元序列，如果两个序列相等则返回 0，如果第一个序列按字典顺序小于第二个序列则返回正值，否则返回负值。标准库包含通用算法 lexicographyal_compare ，它几乎做同样的事情，除了如果第一个序列按字典顺序小于第二个序列则返回 true ，否则返回 false 。

strcpy 复制代码单元序列。您可以改用标准库的copy 算法。

strlen 接受一个指向代码单元的指针，并在找到空值之前计算代码单元的数量。如果您需要此函数而不是仅告诉您字符串中代码单元数量的函数，则可以使用算法 find 来实现它，方法是将 null 值作为要查找的值传递。相反，如果您想查找序列的实际长度，您的类应该只提供一个 size 方法，该方法可以直接访问您的类在内部使用的任何方法来存储大小。

与 str* 函数不同，我建议的算法使用两个迭代器来划分代码单元序列；一个指向序列中的第一个元素，一个指向序列中最后一个元素之后的位置。 str* 函数仅采用指向第一个元素的指针，然后假设序列继续，直到找到第一个零值代码单元。当您实现自己的模板化字符串类时，最好也放弃显式的 null 终止约定，而只提供一个 end() 方法来为字符串提供正确的结束点。

First off, C++ has no need of additional string classes. There are probably already hundreds or thousands too many string classes that have been developed, and yours won't improve the situation. Unless you're doing this purely for your edification, you should think long and hard and then decide not to write a new one.

You can use std::basic_string<char> to hold UTF-8 code unit sequences, std::basic_string<char16_t> to hold UTF-16 code unit sequences, std::basic_string<char32_t> to hold UTF-32 code unit sequences, etc. C++ even offers short, handy names for these types: string, u16string, and u32string. basic_string already solves the problem you're asking about here by offering member functions for copying, comparing, and getting the length of the string that work for any code unit you template it with.

I can't think of any good reason for new code that's not interfacing with legacy code to use anything else as its canonical storage type for strings. Even if you do interface with legacy code that uses something else, if the surface area of that interface isn't large you should probably still use one of the standard types and not anything else, and of course if you're interfacing with legacy code you'll be using that legacy type anyway, not writing your own new type.

With that said, the reason you can't use strcmp, strcpy, and strlen for your templated string type is that they all operate on null terminated byte sequences. If your code unit is larger than one byte then there may be bytes that are zero before the actual terminating null code unit (assuming you use null termination at all, which you probably shouldn't). Consider the bytes of this UTF-16 representation of the string "Hello" (on a little endian machine).

48 00 65 00 6c 00 6c 00  6f 00

Since UTF-16 uses 16 bit code units, the character 'H' ends up stored as the two bytes 48 00. A function operating on the above sequence of bytes by assuming the first null byte is the end would assume that the second half of the first character marks the end of the whole string. This clearly will not work.

So, strcmp, strcpy, and strlen are all specialized versions of algorithms that can be implemented more generally. Since they only work with byte sequences, and you need to work with code unit sequences where the code unit may be larger than a byte, you need need generic algorithms that can work with any code unit. The standard library offers has lots of generic algorithms to offer you. Here are my suggestions for replacing these str* functions.

strcmp compares two sequences of code units and returns 0 if the two sequences are equal, positive if the first is lexicographically less than the second, and negative otherwise. The standard library contains the generic algorithm lexicographical_compare which does nearly the same thing, except that it returns true if the first sequences is lexicographically less than the second and false otherwise.

strcpy copies a sequences of code units. You can use the standard library's copy algorithm instead.

strlen takes a pointer to a code unit and counts the number of code units before it finds a null value. If you need this function as opposed to one that just tells you the number of code units in the string, you can implement it with the algorithm find by passing the null value as the value to be found. If instead you want to find the actual length of the sequence, your class should just offer a size method that directly accesses whatever method your class uses internally to store the size.

Unlike the str* functions, the algorithms I've suggested take two iterators to demarcate code unit sequences; one pointing to the first element in the sequence, and one pointing to the position after the final element of the sequence. The str* functions only take a pointer to the first element and then assume the sequence continues until the first zero valued code unit it finds. When you're implementing your own templated string class it's best to move away from the explicit null termination convention as well, and just offer an end() method that provides the correct end point for your string.

回复收藏 0 原文