是否允许在 std::string 的实现中进行这种优化？

发布于 2024-10-12 01:00:28 字数 1176 浏览 6 评论 0原文

我只是在考虑 std::string::substr 的实现。它返回一个新的 std::string 对象，这对我来说似乎有点浪费。为什么不返回一个引用原始字符串内容并可以隐式分配给 std::string 的对象？一种对实际复制的惰性评估。这样的类可能看起来像这样：

template <class Ch, class Tr, class A>
class string_ref {
public:
    // not important yet, but *looks* like basic_string's for the most part

private:
    const basic_string<Ch, Tr, A> &s_;
    const size_type pos_;
    const size_type len_;    
};

此类的公共接口将模仿真实 std::string 的所有只读操作，因此使用将是无缝的。然后，std::string 可以有一个新的构造函数，它接受 string_ref，因此用户永远不会变得更明智。当您尝试“存储”结果时，您最终会创建一个副本，因此指向数据的引用然后在其背后进行修改并不存在真正的问题。

这个想法是这样的代码：

std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
    std::cout << "match!" << std::endl;
}

总共构造的 std::string 对象不超过 2 个。对于执行大量字符串操作的代码来说，这似乎是一个有用的优化。当然，这不仅仅适用于 std::string，还适用于任何可以返回其内容子集的类型。

据我所知，没有任何实现可以做到这一点。

我想问题的核心是：

给定一个可以根据需要隐式转换为 std::string 的类，它是否符合库编写者更改 a 原型的标准？成员的返回类型？或者更一般地说，在这些类型的情况下，库编写者是否有余地返回“代理对象”而不是常规对象作为优化？

我的直觉是这是不允许的，原型必须完全匹配。鉴于您不能仅在返回类型上重载，因此库编写者将没有空间利用这些类型的情况。就像我说的，我认为答案是否定的，但我想我会问:-)。

原文

I was just thinking about the implementation of std::string::substr. It returns a new std::string object, which seems a bit wasteful to me. Why not return an object that refers to the contents of the original string and can be implicitly assigned to a std::string? A kind of lazy evaluation of the actual copying. Such a class could look something like this:

template <class Ch, class Tr, class A>
class string_ref {
public:
    // not important yet, but *looks* like basic_string's for the most part

private:
    const basic_string<Ch, Tr, A> &s_;
    const size_type pos_;
    const size_type len_;    
};

The public interface of this class would mimic all of the read-only operations of a real std::string, so the usage would be seamless. std::string could then have a new constructor which takes a string_ref so the user would never be the wiser. The moment you try to "store" the result, you end up creating a copy, so no real issues with the reference pointing to data and then having it modified behind its back.

The idea being that code like this:

std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
    std::cout << "match!" << std::endl;
}

would have no more than 2 std::string objects constructed in total. This seems like a useful optimization for code which that performs a lot of string manipulations. Of course, this doesn't just apply to std::string, but to any type which can return a subset of its contents.

As far as I know, no implementations do this.

I suppose the core of the question is:

Given a class that can be implicitly converted to a std::string as needed, would it be conforming to the standard for a library writer to change the prototype of a member's to return type? Or more generally, do the library writers have the leeway to return "proxy objects" instead of regular objects in these types of cases as an optimization?

My gut is that this is not allowed and that the prototypes must match exactly. Given that you cannot overload on return type alone, that would leave no room for library writers to take advantage of these types of situations. Like I said, I think the answer is no, but I figured I'd ask :-).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回忆那么伤 2024-10-19 01:00:28

这个想法是copy-on-write，但是您无需跟踪整个缓冲区，而是跟踪缓冲区的哪个子集是“真实”字符串。（COW，以其正常形式，在某些库实现中使用（是？）。）

因此，您根本不需要代理对象或接口更改，因为这些细节可以完全内部化。从概念上讲，您需要跟踪四件事：源缓冲区、缓冲区的引用计数以及该缓冲区内字符串的开头和结尾。

每当一个操作修改缓冲区时，它都会创建自己的副本（从开始和结束分隔符），将旧缓冲区的引用计数减一，并将新缓冲区的引用计数设置为 1。其余的引用计数规则是相同的：复制并将计数加一，破坏字符串并将计数减一，达到零并删除等。

substr 只是创建一个新的字符串实例，除了明确指定开始和结束分隔符。

回复收藏 0 原文

三生殊途 2024-10-19 01:00:28

这是一个相当有名、应用比较广泛的优化，称为写时复制（copy-on-write）或COW。基本的事情甚至与子字符串无关，但对于像

s1 = s2;

现在这样简单的事情，这种优化的问题是，对于应该在支持多线程的目标上使用的 C++ 库，字符串的引用计数必须是使用原子操作进行访问（或者更糟糕的是，使用互斥体进行保护，以防目标平台不提供原子操作）。这非常昂贵，因此在大多数情况下，简单的非 COW 字符串实现速度更快。

请参阅 GOTW #43-45：

http://www.gotw.ca/gotw/043.htm< /a>

http://www.gotw.ca/gotw/044.htm

< a href="http://www.gotw.ca/gotw/045.htm" rel="nofollow">http://www.gotw.ca/gotw/045.htm

更糟糕的是，使用 COW 的库（例如 GNU C++ 库）不能简单地恢复为简单实现，因为这会破坏 ABI。（尽管如此，C++0x 可以拯救，因为无论如何这都需要 ABI 碰撞！:)）

This is a quite well-known optimization that is relatively widely used, called copy-on-write or COW. The basic thing is not even to do with substrings, but with something as simple as

s1 = s2;

Now, the problem with this optimization is that for C++ libraries that are supposed to be used on targets supporting multiple threads, the reference count for the string has to be accessed using atomic operations (or worse, protected with a mutex in case the target platform doesn't supply atomic operations). This is expensive enough that in most cases the simple non-COW string implementation is faster.

See GOTW #43-45:

http://www.gotw.ca/gotw/043.htm

http://www.gotw.ca/gotw/044.htm

http://www.gotw.ca/gotw/045.htm

To make matters worse, libraries that have used COW, such as the GNU C++ library, cannot simply revert to the simple implementation since that would break the ABI. (Although, C++0x to the rescue, as that will require an ABI bump anyway! :) )

回复收藏 0 原文

紧拥背影 2024-10-19 01:00:28

由于 substr 返回 std::string，因此无法返回代理对象，并且不能仅更改其返回类型或重载（原因如下）你提到过）。

他们可以通过使 string 本身能够成为另一个字符串的子字符串来实现这一点。这意味着所有用法都会受到内存损失（保存一个额外的字符串和两个 size_types）。此外，每个操作都需要检查它是否具有字符或者是代理。也许这可以通过实现指针来完成——问题是，现在我们正在使通用类在可能的边缘情况下变慢。

如果您需要这个，最好的方法是创建另一个类，substring，它由字符串、位置和长度构造，并转换为字符串。您不能将其用作 s1.substr(6)，但您可以这样做

 substring sub(s1, 6);

您还需要创建采用子字符串和字符串的常见操作以避免转换（因为这就是重点）。

Since substr returns std::string, there is no way to return a proxy object, and they can't just change the return type or overload on it (for the reasons you mentioned).

They could do this by making string itself capable of being a sub of another string. This would mean a memory penalty for all usages (to hold an extra string and two size_types). Also, every operation would need to check to see if it has the characters or is a proxy. Perhaps this could be done with an implementation pointer -- the problem is, now we're making a general purpose class slower for a possible edge case.

If you need this, the best way is to create another class, substring, that constructs from a string, pos, and length, and coverts to string. You can't use it as s1.substr(6), but you can do

 substring sub(s1, 6);

You would also need to create common operations that take a substring and string to avoid the conversion (since that's the whole point).

回复收藏 0 原文

上课铃就是安魂曲 2024-10-19 01:00:28

关于您的具体示例，这对我有用：

if (&s1[6] == s2) {
    std::cout << "match!" << std::endl;
}

这可能无法回答您对通用解决方案的问题。为此，您需要子字符串 CoW，正如 @GMan 所建议的那样。

Regarding your specific example, this worked for me:

if (&s1[6] == s2) {
    std::cout << "match!" << std::endl;
}

That may not answer your question for a general-purpose solution. For that, you'd need sub-string CoW, as @GMan suggests.

回复收藏 0 原文

幸福丶如此 2024-10-19 01:00:28

您所谈论的是（或曾经是）Java 的 java.lang.String 类的核心功能之一（http://fishbowl.pastiche.org/2005/04/27/the_string_memory_gotcha/）。在很多方面，Java 的 String 类和 C++ 的 basic_string 模板的设计是相似的，所以我想编写一个 basic_string 模板的实现利用这种“子串优化”是可能的。

您需要考虑的一件事是如何编写 c_str() const 成员的实现。根据一个字符串作为另一个字符串的子字符串的位置，它可能必须创建一个新的副本。如果请求 c_str 的字符串不是尾随子字符串，那么它肯定必须创建内部数组的新副本。我认为这需要在 basic_string 实现的大多数（如果不是全部）数据成员上使用 mutable 关键字，从而大大复杂化其他 const< /code> 方法，因为编译器不再能够帮助程序员确保 const 正确性。

编辑：实际上，为了容纳 c_str() const 和 data() const，您可以使用 const 类型的单个可变字段charT*。最初设置为 NULL，它可以是每个实例的，每当 c_str() const 或 时，初始化为指向新 charT 数组的指针>data() const 被调用，如果非 NULL 则在 basic_string 析构函数中被删除。