从 char* 初始化 std::string 而不复制

发布于 2024-07-09 12:33:34 字数 518 浏览 8 评论 0 原文

我遇到这样的情况：我需要处理大量（许多 GB）数据：

通过附加许多较小的（C char*）字符串构建一个大字符串
修剪字符串
将字符串转换为 C++ const std::string 进行处理（只读）
repeat

每次迭代中的数据都是独立的。

我的问题是，我想最小化（如果可能的话消除）堆分配内存的使用，因为它目前是我最大的性能问题。

有没有办法将 C 字符串 (char*) 转换为 stl C++ 字符串 (std::string)，而不需要 std::string 在内部分配/复制数据？

或者，我可以使用字符串流或类似的东西来重新使用大缓冲区吗？

编辑：感谢您的回答，为了清楚起见，我认为修改后的问题是：

如何有效地构建（通过多个附加）一个 stl C++ 字符串。如果在循环中执行此操作，其中每个循环完全独立，我如何重新使用此分配的空间。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

匿名。 2024-07-16 12:33:34

如果不复制数据，您实际上无法形成 std::string 。字符串流可能会在传递过程中重用内存（尽管我认为标准没有明确是否确实需要这样做），但它仍然无法避免复制。

解决此类问题的常见方法是编写处理步骤 3 中的数据的代码以使用开始/结束迭代器对；然后它可以轻松处理 std::string、字符向量、一对原始指针等。与向其传递 std::string 等容器类型不同，它不再知道或关心内存是如何分配的，因为它仍然属于调用者。将这个想法推向逻辑结论是 boost::range，它添加了所有重载的构造函数，仍然让调用者只需传递带有 .begin() 和 .end() 的字符串/向量/列表/任何类型的容器，或单独的迭代器。

编写处理代码以在任意迭代器范围上工作后，您甚至可以编写一个自定义迭代器（不像听起来那么难，基本上只是一个具有一些标准 typedef 的对象，以及运算符 ++/*/=/==/ != 重载以获得一个只向前的迭代器），每次它到达正在处理的片段的末尾时，它都会处理前进到下一个片段，跳过空白（我认为这就是您所说的修剪的意思）。你根本不需要连续地组装整个字符串。这是否会获胜取决于您拥有多少碎片/有多大的碎片。这本质上就是 Martin York 提到的 SGI 绳索：一个字符串，其中追加形成片段的链接列表而不是连续的缓冲区，因此适合更长的值。

更新（因为我仍然偶尔看到这个答案的赞成票）：

C++17引入了另一种选择：std::string_view 取代了许多函数签名中的 std::string，是对字符数据的非拥有引用。它可以从 std::string 隐式转换，但也可以从其他地方拥有的连续数据显式构造，从而避免不必要的 std::string 复制。

You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.

A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.

Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.

UPDATE (since I still see occasional upvotes on this answer):

C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.

回复收藏 0 原文