是否允许在 std::string 的实现中进行这种优化?

发布于 2024-10-12 01:00:28 字数 1176 浏览 6 评论 0原文

我只是在考虑 std::string::substr 的实现。它返回一个新的 std::string 对象,这对我来说似乎有点浪费。为什么不返回一个引用原始字符串内容并可以隐式分配给 std::string 的对象?一种对实际复制的惰性评估。这样的类可能看起来像这样:

template <class Ch, class Tr, class A>
class string_ref {
public:
    // not important yet, but *looks* like basic_string's for the most part

private:
    const basic_string<Ch, Tr, A> &s_;
    const size_type pos_;
    const size_type len_;    
};

此类的公共接口将模仿真实 std::string 的所有只读操作,因此使用将是无缝的。然后,std::string 可以有一个新的构造函数,它接受 string_ref,因此用户永远不会变得更明智。当您尝试“存储”结果时,您最终会创建一个副本,因此指向数据的引用然后在其背后进行修改并不存在真正的问题。

这个想法是这样的代码:

std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
    std::cout << "match!" << std::endl;
}

总共构造的 std::string 对象不超过 2 个。对于执行大量字符串操作的代码来说,这似乎是一个有用的优化。当然,这不仅仅适用于 std::string,还适用于任何可以返回其内容子集的类型。

据我所知,没有任何实现可以做到这一点。

我想问题的核心是:

给定一个可以根据需要隐式转换为 std::string 的类,它是否符合库编写者更改 a 原型的标准?成员的返回类型?或者更一般地说,在这些类型的情况下,库编写者是否有余地返回“代理对象”而不是常规对象作为优化?

我的直觉是这是不允许的,原型必须完全匹配。鉴于您不能仅在返回类型上重载,因此库编写者将没有空间利用这些类型的情况。就像我说的,我认为答案是否定的,但我想我会问:-)。

I was just thinking about the implementation of std::string::substr. It returns a new std::string object, which seems a bit wasteful to me. Why not return an object that refers to the contents of the original string and can be implicitly assigned to a std::string? A kind of lazy evaluation of the actual copying. Such a class could look something like this:

template <class Ch, class Tr, class A>
class string_ref {
public:
    // not important yet, but *looks* like basic_string's for the most part

private:
    const basic_string<Ch, Tr, A> &s_;
    const size_type pos_;
    const size_type len_;    
};

The public interface of this class would mimic all of the read-only operations of a real std::string, so the usage would be seamless. std::string could then have a new constructor which takes a string_ref so the user would never be the wiser. The moment you try to "store" the result, you end up creating a copy, so no real issues with the reference pointing to data and then having it modified behind its back.

The idea being that code like this:

std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
    std::cout << "match!" << std::endl;
}

would have no more than 2 std::string objects constructed in total. This seems like a useful optimization for code which that performs a lot of string manipulations. Of course, this doesn't just apply to std::string, but to any type which can return a subset of its contents.

As far as I know, no implementations do this.

I suppose the core of the question is:

Given a class that can be implicitly converted to a std::string as needed, would it be conforming to the standard for a library writer to change the prototype of a member's to return type? Or more generally, do the library writers have the leeway to return "proxy objects" instead of regular objects in these types of cases as an optimization?

My gut is that this is not allowed and that the prototypes must match exactly. Given that you cannot overload on return type alone, that would leave no room for library writers to take advantage of these types of situations. Like I said, I think the answer is no, but I figured I'd ask :-).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

回忆那么伤 2024-10-19 01:00:28

这个想法是copy-on-write,但是您无需跟踪整个缓冲区,而是跟踪缓冲区的哪个子集是“真实”字符串。 (COW,以其正常形式,在某些库实现中使用(是?)。)

因此,您根本不需要代理对象或接口更改,因为这些细节可以完全内部化。从概念上讲,您需要跟踪四件事:源缓冲区、缓冲区的引用计数以及该缓冲区内字符串的开头和结尾。

每当一个操作修改缓冲区时,它都会创建自己的副本(从开始和结束分隔符),将旧缓冲区的引用计数减一,并将新缓冲区的引用计数设置为 1。其余的引用计数规则是相同的:复制并将计数加一,破坏字符串并将计数减一,达到零并删除等。

substr 只是创建一个新的字符串实例,除了明确指定开始和结束分隔符。

This idea is copy-on-write, but instead of COW'ing the entire buffer, you keep track of which subset of the buffer is the "real" string. (COW, in its normal form, was (is?) used in some library implementations.)

So you don't need a proxy object or change of interface at all because these details can be made completely internal. Conceptually, you need to keep track of four things: a source buffer, a reference count for the buffer, and the start and end of the string within this buffer.

Anytime an operation modifies the buffer at all, it creates its own copy (from the start and end delimiters), decreases the old buffer's reference count by one, and sets the new buffer's reference count to one. The rest of the reference counting rules are the same: copy and increase count by one, destruct a string and decrease count by one, reach zero and delete, etc.

substr just makes a new string instance, except with the start and end delimiters explicitly specified.

三生殊途 2024-10-19 01:00:28

这是一个相当有名、应用比较广泛的优化,称为写时复制(copy-on-write)或COW。基本的事情甚至与子字符串无关,但对于像

s1 = s2;

现在这样简单的事情,这种优化的问题是,对于应该在支持多线程的目标上使用的 C++ 库,字符串的引用计数必须是使用原子操作进行访问(或者更糟糕的是,使用互斥体进行保护,以防目标平台不提供原子操作)。这非常昂贵,因此在大多数情况下,简单的非 COW 字符串实现速度更快。

请参阅 GOTW #43-45:

http://www.gotw.ca/gotw/043.htm< /a>

http://www.gotw.ca/gotw/044.htm

< a href="http://www.gotw.ca/gotw/045.htm" rel="nofollow">http://www.gotw.ca/gotw/045.htm

更糟糕的是,使用 COW 的库(例如 GNU C++ 库)不能简单地恢复为简单实现,因为这会破坏 ABI。 (尽管如此,C++0x 可以拯救,因为无论如何这都需要 ABI 碰撞!:))

This is a quite well-known optimization that is relatively widely used, called copy-on-write or COW. The basic thing is not even to do with substrings, but with something as simple as

s1 = s2;

Now, the problem with this optimization is that for C++ libraries that are supposed to be used on targets supporting multiple threads, the reference count for the string has to be accessed using atomic operations (or worse, protected with a mutex in case the target platform doesn't supply atomic operations). This is expensive enough that in most cases the simple non-COW string implementation is faster.

See GOTW #43-45:

http://www.gotw.ca/gotw/043.htm

http://www.gotw.ca/gotw/044.htm

http://www.gotw.ca/gotw/045.htm

To make matters worse, libraries that have used COW, such as the GNU C++ library, cannot simply revert to the simple implementation since that would break the ABI. (Although, C++0x to the rescue, as that will require an ABI bump anyway! :) )

紧拥背影 2024-10-19 01:00:28

由于 substr 返回 std::string,因此无法返回代理对象,并且不能仅更改其返回类型或重载(原因如下)你提到过)。

他们可以通过使 string 本身能够成为另一个字符串的子字符串来实现这一点。这意味着所有用法都会受到内存损失(保存一个额外的字符串和两个 size_types)。此外,每个操作都需要检查它是否具有字符或者是代理。也许这可以通过实现指针来完成——问题是,现在我们正在使通用类在可能的边缘情况下变慢。

如果您需要这个,最好的方法是创建另一个类,substring,它由字符串、位置和长度构造,并转换为字符串。您不能将其用作 s1.substr(6),但您可以这样做

 substring sub(s1, 6);

您还需要创建采用子字符串和字符串的常见操作以避免转换(因为这就是重点) 。

Since substr returns std::string, there is no way to return a proxy object, and they can't just change the return type or overload on it (for the reasons you mentioned).

They could do this by making string itself capable of being a sub of another string. This would mean a memory penalty for all usages (to hold an extra string and two size_types). Also, every operation would need to check to see if it has the characters or is a proxy. Perhaps this could be done with an implementation pointer -- the problem is, now we're making a general purpose class slower for a possible edge case.

If you need this, the best way is to create another class, substring, that constructs from a string, pos, and length, and coverts to string. You can't use it as s1.substr(6), but you can do

 substring sub(s1, 6);

You would also need to create common operations that take a substring and string to avoid the conversion (since that's the whole point).

上课铃就是安魂曲 2024-10-19 01:00:28

关于您的具体示例,这对我有用:

if (&s1[6] == s2) {
    std::cout << "match!" << std::endl;
}

这可能无法回答您对通用解决方案的问题。为此,您需要子字符串 CoW,正如 @GMan 所建议的那样。

Regarding your specific example, this worked for me:

if (&s1[6] == s2) {
    std::cout << "match!" << std::endl;
}

That may not answer your question for a general-purpose solution. For that, you'd need sub-string CoW, as @GMan suggests.

幸福丶如此 2024-10-19 01:00:28

您所谈论的是(或曾经是)Java 的 java.lang.String 类的核心功能之一(http://fishbowl.pastiche.org/2005/04/27/the_string_memory_gotcha/)。在很多方面,Java 的 String 类和 C++ 的 basic_string 模板的设计是相似的,所以我想编写一个 basic_string 模板的实现利用这种“子串优化”是可能的。

您需要考虑的一件事是如何编写 c_str() const 成员的实现。根据一个字符串作为另一个字符串的子字符串的位置,它可能必须创建一个新的副本。如果请求 c_str 的字符串不是尾随子字符串,那么它肯定必须创建内部数组的新副本。我认为这需要在 basic_string 实现的大多数(如果不是全部)数据成员上使用 mutable 关键字,从而大大复杂化其他 const< /code> 方法,因为编译器不再能够帮助程序员确保 const 正确性。

编辑:实际上,为了容纳 c_str() constdata() const,您可以使用 const 类型的单个可变字段charT*。最初设置为 NULL,它可以是每个实例的,每当 c_str() const时,初始化为指向新 charT 数组的指针>data() const 被调用,如果非 NULL 则在 basic_string 析构函数中被删除。

What you are talking about is (or was) one of the core features of Java's java.lang.String class (http://fishbowl.pastiche.org/2005/04/27/the_string_memory_gotcha/). In many ways, the designs of Java's String class and C++'s basic_string template are similar, so I would imagine that writing an implementation of the basic_string template utilizing this "substring optimization" is possible.

One thing that you will need to consider is how to write the implementation of the c_str() const member. Depending on the location of a string as a substring of another, it may have to create a new copy. It definitely would have to create a new copy of the internal array if the string for which the c_str was requested is not a trailing substring. I think that this necessitates using the mutable keyword on most, if not all, of the data members of the basic_string implementation, greatly complicating the implementation of other const methods because the compiler is no longer able to assist the programmer with const correctness.

EDIT: Actually, to accommodate c_str() const and data() const, you could use a single mutable field of type const charT*. Initially set to NULL, it could be per-instance, initialized to a pointer to a new charT array whenever c_str() const or data() const are called, and deleted in the basic_string destructor if non-NULL.

ㄖ落Θ余辉 2024-10-19 01:00:28

当且仅当您确实需要比 std::string 提供的更多性能时,然后继续编写一些按您需要的方式工作的东西。我以前曾使用过字符串的变体。

我自己的偏好是使用不可变字符串而不是写时复制,并使用 boost::shared_ptr 或等效项,但仅当字符串实际长度超过 16 时,因此字符串类还有一个简短的私有缓冲区字符串。

这确实意味着字符串类可能有一点分量。

我的集合列表中还有一个“切片”类,只要原始对象的生命周期完好无损,它就可以查看位于其他地方的类的“子集”。因此,在您的情况下,我可以对字符串进行切片以查看子字符串。当然,它不会是空终止的,也没有任何方法可以在不复制它的情况下使其成为空终止。而且它不是字符串类。

If and only if you really need more performance than std::string provides then go ahead and write something that works the way you need it to. I have worked with variants of strings before.

My own preference is to use non-mutable strings rather than copy-on-write, and to use boost::shared_ptr or equivalent but only when the string is actually beyond 16 in length, so the string class also has a private buffer for short strings.

This does mean that the string class might carry a bit of weight.

I also have in my collection list a "slice" class that can look at a "subset" of a class that lives elsewhere as long as the lifetime of the original object is intact. So in your case I could slice the string to see a substring. Of course it would not be null-terminated, nor is there any way of making it such without copying it. And it is not a string class.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文