在 C++ 中使用哪些字符串类?
我们有一个用 C++ (MFC) 编写的多线程桌面应用程序。目前开发人员使用 CString 或 std::string,可能取决于他们的心情。所以我们想选择一个实现(可能不是这两个)。
MFC 的 CString 基于写时复制 (COW) 习惯用法,有些人会声称这在多线程环境中是不可接受的(可能参考 本文)。我不相信这样的说法,因为原子计数器似乎相当快,而且这种开销在某种程度上可以通过减少内存重新分配来补偿。
我了解到 std::string 实现取决于编译器 - 它在 MSVC 中不是 COW,但在 gcc 中是或曾经是。据我了解,新的 C++0x 标准将通过要求非 COW 实现来解决此问题,并解决一些其他问题,例如连续缓冲区要求。所以实际上 std::string 在这一点上看起来没有很好地定义...
一个我不喜欢 std::string 的简单例子:没有办法从函数返回字符串而无需过多的重新分配(复制构造函数,如果按值返回,并且无法访问内部缓冲区来优化它,因此“按引用返回”(例如 std::string& Result
没有帮助)。我可以使用 CString 通过按值返回(由于 COW 而无需复制)或按引用传递并直接访问缓冲区来完成此操作。同样,C++0x 以其右值引用来救援,但我们不会在最近的功能中使用 C++0x。
我们应该使用哪个字符串类? COW 真的会成为一个问题吗?还有其他常用的字符串高效实现吗?谢谢。
编辑:我们目前不使用 unicode,而且我们不太可能需要它。然而,如果有一些东西可以轻松支持 unicode(而不是以 ICU 为代价......),那将是一个优势。
we have a multi-threaded desktop application in C++ (MFC). Currently developers use either CString or std::string, probably depending on their mood. So we'd like to choose a single implementation (probably something other than those two).
MFC's CString is based on copy-on-write (COW) idiom, and some people would claim this is unacceptable in a multithreaded environment (and probably reference to this article). I am not convinced by such claims, as atomic counters seem to be quite fast, and also this overhead is somehow compensated by a reduction in memory re-allocations.
I learned that std::string implementation depends on compiler - it is not COW in MSVC but it is, or was in gcc. As far as I understood, the new C++0x standard is going to fix this by requiring a non-COW implementation and resolve some other issues, such as contiguous buffer requirements. So actually std::string looks not well defined at this point...
A quick example of what I don't like about std::string: no way to return a string from a function without excessive re-allocations (copy constructor if return by value, and no access to internal buffer to optimize that so "return by reference" e.g. std::string& Result
doesn't help). I can do this with CString by either returning by value (no copy due to COW) or passing by reference and accessing the buffer directly. Again, C++0x to the rescue with its rvalue references, but we are not going to have C++0x in the nearest feature.
Which string class should we use? Can COW really become an issue? Are there other commonly used efficient implementations of strings? Thanks.
EDIT: We don't use unicode at the moment, and it is unlikely that we will need it. However, if there is something easily supporting unicode (not at the cost of ICU...), that would be a plus.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我会使用
std::string
。“按值返回”问题基本上不是问题。编译器非常擅长执行返回值优化 (RVO),这实际上在大多数情况下在按值返回时消除了副本。如果没有,您通常可以调整该功能。
COW 已被拒绝,原因是:它无法扩展(很好),而且所希望的速度提升尚未得到真正测量(参见 Herb Sutter 的 文章)。原子操作并不像看上去那么便宜。对于单处理器单核来说这很容易,但现在多核已成为商品,并且多处理器已广泛使用(用于服务器)。在这种分布式架构中,存在多个需要同步的缓存,并且架构越分布式,原子操作的成本就越高。
CString
是否实现了小字符串优化?这是一个简单的技巧,允许字符串不为小字符串(通常是几个字符)分配任何内存。非常有用,因为事实证明大多数字符串实际上都很小,您的应用程序中有多少字符串长度小于 8 个字符?因此,除非您向我提供一个真正的基准,清楚地显示使用 CString 的净收益,否则我宁愿坚持使用标准:它是标准的,并且可能会得到更好的优化。
I would use
std::string
.The "return by value" issue is mostly a non-issue. Compilers are very good at performing Return Value Optimization (RVO) which actually eliminates the copy in most cases when returning by value. If it doesn't, you can usually tweak the function.
COW has been rejected for a reason: it doesn't scale (well) and the so-hoped-for increase in speed has not been really measured (see Herb Sutter's article). Atomic operations are not as cheap as they appear. With mono-processor mono-core it was easy, but now multi-core are commodity and multi-processors are widely available (for servers). In such distributed architectures there are multiple caches, that need be synchronized, and the more distributed the architecture, the more costly the atomic operations.
Does
CString
implement Small String Optimization ? It's a simple trick that allows a string not to allocate any memory for small strings (usually a few characters). Very useful because it turns out that most strings are in fact small, how many strings in your application are less than 8-characters long ?So, unless you present me a real benchmark which clearly shows a net gain in using
CString
, I'd prefer sticking with the standard: it's standard, and likely better optimized.事实上,答案可能是“视情况而定”。但是,如果您使用 MFC,恕我直言,使用 CString 会更好。此外,您还可以将 CString 与 STL 容器一起使用。但是,这会导致另一个问题,我应该使用 stl 容器还是带有 CString 的 MFC 容器?使用 CString 将为您的应用程序提供灵活性,例如在 unicode 转换中。
编辑:此外,如果您使用 WIN32 api 调用,CString 转换会更容易。
编辑:CString 有一个 GetBuffer() 和有关允许您直接修改缓冲区的方法。
编辑:我在 SQLite 包装器中使用了 CString,并且格式化 CString 更容易。
Actually, the answer may be "It depends". But, if you are using MFC, IMHO, CString usage would be better. Also, you can use CString with STL containers also. But, it will lead to another question, should I use stl containers or MFC containers with CString? Usage of CString will provide agility to your application for example in unicode conversions.
EDIT: Moreover, if you use WIN32 api calls, CString conversions will be easier.
EDIT: CString has a GetBuffer() and regarding methods that allow you to modify buffer directly.
EDIT: I have used CString in our SQLite wrapper, and formatting CString is easier.
我不知道任何其他常见的字符串实现 - 它们都受到 C++03 中相同语言的限制。要么它们提供一些特定的东西,比如 ICU 组件如何非常适合 Unicode,它们真的像 CString 一样古老,要么 std::string 胜过它们。
但是,您可以使用 MSVC9 SP1 STL 使用的相同技术,即“交换化”,这是有史以来命名最搞笑的优化。
如果您滚动了一个自定义字符串类,该类没有在其默认构造函数中分配任何内容(或检查了您的 STL 实现),那么对其进行交换优化将保证没有多余的分配。例如,我的 MSVC STL 使用 SSO,并且默认情况下不分配任何堆内存,因此通过交换上述内容,我不会获得冗余分配。
只需不使用昂贵的堆分配,您也可以显着提高性能。有一些专为临时分配而设计的分配器,您可以用自定义分配器替换您最喜欢的 STL 实现中使用的分配器。你可以从 Boost 获得对象池之类的东西,或者滚动内存竞技场。与普通的新分配相比,您可以获得十倍更好的性能。
I don't know of any other common string implementations- they all suffer from the same language limitations in C++03. Either they offer something specific, like how the ICU components are great for Unicode, they're really old like CString is, or std::string trumps them.
However, you can use the same technique that the MSVC9 SP1 STL uses- that is, "swaptimization", which is the most hilariously named optimization ever.
If you rolled a custom string class that didn't allocate anything in it's default constructor (or checked your STL implementation), then swaptimizing it would guarantee no redundant allocations. For example, my MSVC STL uses SSO and doesn't allocate any heap memory by default, so by swaptimizing the above, I get no redundant allocations.
You could improve performance substantially too by just not using expensive heap allocation. There are allocators designed for temporary allocations, and you can replace the allocator used in your favourite STL implementation with a custom one. You can get things like object pools from Boost or roll a memory arena. You can get tenfold better performance compared to a normal new allocation.
我建议做出“每个 DLL”的决定。如果您的 DLL 严重依赖于 MFC(例如 GUI 层),并且需要使用
CString
参数进行大量 MFC 调用,请使用CString
。如果您的 DLL 中您要使用的唯一来自 MFC 的内容是 CString 类,请改用std::string
。当然,您需要两个类之间的转换函数,但我怀疑您已经解决了这个问题。I would suggest making a "per DLL" decision. If you have DLLs depending heavily on MFC (for example, your GUI layer), where you need a lot of MFC calls with
CString
parameters, useCString
. If you have DLLs where the only thing from MFC you are going to use would be the CString class, usestd::string
instead. Of course, you will need conversion function between both classes, but I suspect you have already solved that issue.我说总是选择
std::string
。如前所述,RVO 和 NVRO 将使通过副本返回变得便宜,并且当您最终切换到 C++0x 时,您可以从移动语义中获得不错的性能提升,而无需执行任何操作。如果你想获取任何代码并在非 ATL/MFC 项目中使用它,你不能使用 CString,但std::string
会在那里,所以你会更容易时间。最后,您在评论中提到您使用 STL 容器而不是 MFC 容器(好举动)。为什么不保持一致并使用 STL 字符串呢?I say always go for
std::string
. As mentioned, RVO and NVRO will make returning by copies cheap, and when you do end up switching to C++0x eventually, you get a nice performance boost from move semantics, without doing anything. If you want to take any code and use it in a non-ATL/MFC project, you can't use CString, butstd::string
will be there, so you'll have a much easier time. Finally, you mentioned in a comment you use STL containers instead of MFC containers (good move). Why not stay consistent and use STL string too?我建议使用 std::basic_string 作为通用字符串模板库,除非有充分的理由不这样做。我说 basic_string 是因为如果您处理 16 位字符,您将使用 wstring。
如果您要使用 TCHAR,您可能应该将 tstring 定义为 basic_string,并且可能希望为其实现一个特征类以使用 _tcslen 等函数。
I would advise using std::basic_string as your general string template base unless there is a good reason to do otherwise. I say basic_string because if you are handling 16-bit characters you would use wstring.
If you are going to use TCHAR you should probably define tstring as basic_string and may wish to implement a traits class for it too to use functions like _tcslen etc.
std::string 通常是引用计数的,因此值传递仍然是一种廉价的操作(对于 C++0x 中的右值引用更是如此)。仅当有多个引用指向它们的字符串时才会触发 COW,即:
将经过 COW 路径。由于 COW 发生在
operator[]
内部,您可以通过使用其(非常量)operator[]()
或强制字符串使用私有缓冲区begin() 方法。
std::string
is usually reference counted, so pass-by-value is still a cheap operation (and even more so with the rvalue reference stuff in C++0x). The COW is triggered only for strings that have multiple references pointing to them, i.e.:will go through the COW path. As the COW happens inside
operator[]
, you can force a string to use a private buffer by using its (non-const)operator[]()
orbegin()
methods.