从 char* 初始化 std::string 而不复制

发布于 2024-07-09 12:33:34 字数 518 浏览 8 评论 0 原文

我遇到这样的情况:我需要处理大量(许多 GB)数据:

  1. 通过附加许多较小的(C char*)字符串构建一个大字符串
  2. 修剪字符串
  3. 将字符串转换为 C++ const std::string 进行处理(只读)
  4. repeat

每次迭代中的数据都是独立的。

我的问题是,我想最小化(如果可能的话消除)堆分配内存的使用,因为它目前是我最大的性能问题。

有没有办法将 C 字符串 (char*) 转换为 stl C++ 字符串 (std::string),而不需要 std::string 在内部分配/复制数据?

或者,我可以使用字符串流或类似的东西来重新使用大缓冲区吗?

编辑:感谢您的回答,为了清楚起见,我认为修改后的问题是:

如何有效地构建(通过多个附加)一个 stl C++ 字符串。 如果在循环中执行此操作,其中每个循环完全独立,我如何重新使用此分配的空间。

I have a situation where I need to process large (many GB's) amounts of data as such:

  1. build a large string by appending many smaller (C char*) strings
  2. trim the string
  3. convert the string into a C++ const std::string for processing (read only)
  4. repeat

The data in each iteration are independent.

My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.

Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?

Alternatively, could I use stringstreams or something similar to re-use a large buffer?

Edit: Thanks for the answers, for clarity, I think a revised question would be:

How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

匿名。 2024-07-16 12:33:34

如果不复制数据,您实际上无法形成 std​​::string 。 字符串流可能会在传递过程中重用内存(尽管我认为标准没有明确是否确实需要这样做),但它仍然无法避免复制。

解决此类问题的常见方法是编写处理步骤 3 中的数据的代码以使用开始/结束迭代器对; 然后它可以轻松处理 std::string、字符向量、一对原始指针等。与向其传递 std::string 等容器类型不同,它不再知道或关心内存是如何分配的,因为它仍然属于调用者。 将这个想法推向逻辑结论是 boost::range,它添加了所有重载的构造函数,仍然让调用者只需传递带有 .begin() 和 .end() 的字符串/向量/列表/任何类型的容器,或单独的迭代器。

编写处理代码以在任意迭代器范围上工作后,您甚至可以编写一个自定义迭代器(不像听起来那么难,基本上只是一个具有一些标准 typedef 的对象,以及运算符 ++/*/=/==/ != 重载以获得一个只向前的迭代器),每次它到达正在处理的片段的末尾时,它都会处理前进到下一个片段,跳过空白(我认为这就是您所说的修剪的意思)。 你根本不需要连续地组装整个字符串。 这是否会获胜取决于您拥有多少碎片/有多大的碎片。 这本质上就是 Martin York 提到的 SGI 绳索:一个字符串,其中追加形成片段的链接列表而不是连续的缓冲区,因此适合更长的值。


更新(因为我仍然偶尔看到这个答案的赞成票):

C++17引入了另一种选择:std::string_view 取代了许多函数签名中的 std::string,是对字符数据的非拥有引用。 它可以从 std::string 隐式转换,但也可以从其他地方拥有的连续数据显式构造,从而避免不必要的 std::string 复制。

You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.

A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.

Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.


UPDATE (since I still see occasional upvotes on this answer):

C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.

じ违心 2024-07-16 12:33:34

是否可以在步骤 1 中使用 C++ 字符串? 如果您使用 string::reserve(size_t),您可以分配足够大的缓冲区,以防止在附加较小字符串时进行多次堆分配,然后您可以在所有字符串中使用相同的 C++ 字符串。剩余步骤。

有关保留此链接 > 功能。

Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.

See this link for more information on the reserve function.

樱&纷飞 2024-07-16 12:33:34

为了帮助处理非常大的字符串,SGI 在其 STL 中提供了 Rope 类。
非标准但可能有用。

http://www.sgi.com/tech/stl/Rope.html

显然绳子是标准的下一版本:-)
请注意开发人员的笑话。 绳子是一根大绳子。 (哈哈) :-)

To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.

http://www.sgi.com/tech/stl/Rope.html

Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)

蓝眼泪 2024-07-16 12:33:34

这是一种横向思考的答案,不是直接解决问题,而是围绕问题进行“思考”。 可能有用,可能没用...

std::string 的只读处理实际上并不需要 std::string 功能的非常复杂的子集。 您是否有可能对对 std::strings 执行所有处理的代码进行搜索/替换,以便它采用其他类型? 从一个空白类开始:

class lightweight_string { };

然后用lightweight_string替换所有std::string引用。 执行编译以准确找出lightweight_string 需要执行哪些操作才能充当直接替换。 然后您就可以按照您想要的方式进行实施。

This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...

Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:

class lightweight_string { };

Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.

渡你暖光 2024-07-16 12:33:34

每次迭代是否足够独立,以至于您可以为每次迭代使用相同的 std::string ? 人们希望你的 std::string 实现足够聪明,如果你在它以前用于其他用途时为其分配一个 const char * ,那么它可以重用内存。

将 char * 分配给 std::string 必须始终至少复制数据。 内存管理是使用 std::string 的主要原因之一,因此您将无法覆盖它。

Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.

Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.

绝情姑娘 2024-07-16 12:33:34

在这种情况下,直接处理 char* 可能比将其分配给 std::string 更好。

In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文