std::vector 比普通数组慢很多吗？

发布于 2024-09-18 03:49:46 字数 3098 浏览 3 评论 0原文

我一直认为 std::vector 是“作为数组实现”是普遍的看法，等等等等。今天我下去测试了一下，好像不是这样：

这是一些测试结果：

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

大约慢了3-4倍！并不能真正证明“向量可能会慢几纳秒”的评论。

我使用的代码：

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
    public:
        TestTimer(const std::string & name) : name(name),
            start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
        {
        }

        ~TestTimer()
        {
            using namespace std;
            using namespace boost;

            posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
            posix_time::time_duration d = now - start;

            cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
                " seconds" << endl;
        }

    private:
        std::string name;
        boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

我做错了吗？或者我刚刚打破了这个性能神话？

我在 Visual Studio 2005 中使用发布模式。

在 Visual C++ 中，#define _SECURE_SCL 0 将 UseVector 减少一半（将其缩短至 4 秒）。在我看来，这确实是巨大的。

原文

I've always thought it's the general wisdom that std::vector is "implemented as an array," blah blah blah. Today I went down and tested it, and it seems to be not so:

Here's some test results:

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

That's about 3 - 4 times slower! Doesn't really justify for the "vector may be slower for a few nanosecs" comments.

And the code I used:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
    public:
        TestTimer(const std::string & name) : name(name),
            start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
        {
        }

        ~TestTimer()
        {
            using namespace std;
            using namespace boost;

            posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
            posix_time::time_duration d = now - start;

            cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
                " seconds" << endl;
        }

    private:
        std::string name;
        boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

Am I doing it wrong or something? Or have I just busted this performance myth?

I'm using Release mode in Visual Studio 2005.

In Visual C++, #define _SECURE_SCL 0 reduces UseVector by half (bringing it down to 4 seconds). This is really huge, IMO.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十年不长 2024-09-25 03:49:46

使用以下内容：

g++ -O3 Time.cpp -I
./a.out
UseArray 在 2.196 秒内完成
UseVector 在 4.412 秒内完成
UseVectorPushBack 在 8.017 秒内完成
整个过程在 14.626 秒内完成

，因此数组的速度是向量的两倍。

但是在更详细地查看代码之后，这是预期的；当你遇到向量两次而数组只运行一次时。注意：当您resize()向量时，您不仅分配内存，而且还运行向量并调用每个成员的构造函数。

稍微重新安排代码，以便向量仅初始化每个对象一次：

 std::vector<Pixel>  pixels(dimensions * dimensions, Pixel(255,0,0));

现在再次执行相同的计时：

g++ -O3 Time.cpp -I
./a.out
UseVector 在 2.216 秒内完成

向量现在的性能仅比数组稍差。在我看来，这种差异是微不足道的，可能是由一大堆与测试无关的事情引起的。

我还要考虑到您没有在 UseArrray() 方法中正确初始化/销毁 Pixel 对象，因为构造函数/析构函数都没有被调用（对于这个简单的类来说这可能不是问题，但是任何稍微复杂的东西（即使用指针或带有指针的成员）都会导致问题。

Using the following:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 2.196 seconds
UseVector completed in 4.412 seconds
UseVectorPushBack completed in 8.017 seconds
The whole thing completed in 14.626 seconds

So array is twice as quick as vector.

But after looking at the code in more detail this is expected; as you run across the vector twice and the array only once. Note: when you resize() the vector you are not only allocating the memory but also running through the vector and calling the constructor on each member.

Re-Arranging the code slightly so that the vector only initializes each object once:

 std::vector<Pixel>  pixels(dimensions * dimensions, Pixel(255,0,0));

Now doing the same timing again:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseVector completed in 2.216 seconds

The vector now performance only slightly worse than the array. IMO this difference is insignificant and could be caused by a whole bunch of things not associated with the test.

I would also take into account that you are not correctly initializing/Destroying the Pixel object in the UseArrray() method as neither constructor/destructor is not called (this may not be an issue for this simple class but anything slightly more complex (ie with pointers or members with pointers) will cause problems.

回复收藏 0 原文

瑕疵 2024-09-25 03:49:46

很好的问题。我来到这里希望找到一些简单的修复方法来加快矢量测试的速度。结果并不像我预期的那样！

优化有帮助，但这还不够。通过优化，我仍然发现 UseArray 和 UseVector 之间有 2 倍的性能差异。有趣的是，在没有优化的情况下，UseVector 明显比 UseVectorPushBack 慢。

# g++ -Wall -Wextra -pedantic -o vector vector.cpp
# ./vector
UseArray completed in 20.68 seconds
UseVector completed in 120.509 seconds
UseVectorPushBack completed in 37.654 seconds
The whole thing completed in 178.845 seconds
# g++ -Wall -Wextra -pedantic -O3 -o vector vector.cpp
# ./vector
UseArray completed in 3.09 seconds
UseVector completed in 6.09 seconds
UseVectorPushBack completed in 9.847 seconds
The whole thing completed in 19.028 seconds

想法 #1 - 使用 new[] 而不是 malloc

我尝试将 UseArray 中的 malloc() 更改为 new[] ，以便构造对象。并从单独的字段分配更改为分配 Pixel 实例。哦，并将内部循环变量重命名为j。

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {   
        int dimension = 999;

        // Same speed as malloc().
        Pixel * pixels = new Pixel[dimension * dimension];

        for(int j = 0 ; j < dimension * dimension; ++j)
            pixels[j] = Pixel(255, 0, 0);

        delete[] pixels;
    }
}

令人惊讶的是（对我来说），这些变化都没有产生任何影响。甚至不更改默认构造所有像素的 new[] 。看起来 gcc 可以在使用 new[] 时优化默认构造函数调用，但在使用 vector 时则不能。

想法#2 - 删除重复的operator[]调用

我还尝试摆脱三重operator[]查找并缓存对pixels[j]的引用。这实际上减慢了 UseVector 的速度！哎呀。

for(int j = 0; j < dimension * dimension; ++j)
{
    // Slower than accessing pixels[j] three times.
    Pixel &pixel = pixels[j];
    pixel.r = 255;
    pixel.g = 0;
    pixel.b = 0;
}

# ./vector 
UseArray completed in 3.226 seconds
UseVector completed in 7.54 seconds
UseVectorPushBack completed in 9.859 seconds
The whole thing completed in 20.626 seconds

想法 #3 - 删除构造函数

完全删除构造函数怎么样？那么也许 gcc 可以在创建向量时优化所有对象的构造。如果我们将 Pixel 更改为：

struct Pixel
{
    unsigned char r, g, b;
};

结果：速度提高约 10%。还是比数组慢。嗯。

# ./vector 
UseArray completed in 3.239 seconds
UseVector completed in 5.567 seconds

想法 #4 - 使用迭代器而不是循环索引

使用 vector::iterator 而不是循环索引怎么样？

for (std::vector<Pixel>::iterator j = pixels.begin(); j != pixels.end(); ++j)
{
    j->r = 255;
    j->g = 0;
    j->b = 0;
}

结果：

# ./vector 
UseArray completed in 3.264 seconds
UseVector completed in 5.443 seconds

不，没有什么不同。至少速度不慢。我认为这会具有类似于 #2 的性能，其中我使用了 Pixel& 参考。

结论

即使某些聪明的 cookie 弄清楚了如何使向量循环与数组循环一样快，但这并不能很好地说明 std::vector 的默认行为。编译器足够聪明，可以优化所有 C++ 特性并使 STL 容器与原始数组一样快。

最重要的是，在使用 std::vector 时，编译器无法优化无操作默认构造函数调用。如果您使用普通的 new[] 它可以很好地优化它们。但对于 std::vector 则不然。即使您可以重写代码以消除与这里的口头禅相反的构造函数调用：“编译器比您更聪明。STL 与普通 C 一样快。别担心。”

Great question. I came in here expecting to find some simple fix that would speed the vector tests right up. That didn't work out quite like I expected!

Optimization helps, but it's not enough. With optimization on I'm still seeing a 2X performance difference between UseArray and UseVector. Interestingly, UseVector was significantly slower than UseVectorPushBack without optimization.

# g++ -Wall -Wextra -pedantic -o vector vector.cpp
# ./vector
UseArray completed in 20.68 seconds
UseVector completed in 120.509 seconds
UseVectorPushBack completed in 37.654 seconds
The whole thing completed in 178.845 seconds
# g++ -Wall -Wextra -pedantic -O3 -o vector vector.cpp
# ./vector
UseArray completed in 3.09 seconds
UseVector completed in 6.09 seconds
UseVectorPushBack completed in 9.847 seconds
The whole thing completed in 19.028 seconds

Idea #1 - Use new[] instead of malloc

I tried changing malloc() to new[] in UseArray so the objects would get constructed. And changing from individual field assignment to assigning a Pixel instance. Oh, and renaming the inner loop variable to j.

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {   
        int dimension = 999;

        // Same speed as malloc().
        Pixel * pixels = new Pixel[dimension * dimension];

        for(int j = 0 ; j < dimension * dimension; ++j)
            pixels[j] = Pixel(255, 0, 0);

        delete[] pixels;
    }
}

Surprisingly (to me), none of those changes made any difference whatsoever. Not even the change to new[] which will default construct all of the Pixels. It seems that gcc can optimize out the default constructor calls when using new[], but not when using vector.

Idea #2 - Remove repeated operator[] calls

I also attempted to get rid of the triple operator[] lookup and cache the reference to pixels[j]. That actually slowed UseVector down! Oops.

for(int j = 0; j < dimension * dimension; ++j)
{
    // Slower than accessing pixels[j] three times.
    Pixel &pixel = pixels[j];
    pixel.r = 255;
    pixel.g = 0;
    pixel.b = 0;
}

# ./vector 
UseArray completed in 3.226 seconds
UseVector completed in 7.54 seconds
UseVectorPushBack completed in 9.859 seconds
The whole thing completed in 20.626 seconds

Idea #3 - Remove constructors

What about removing the constructors entirely? Then perhaps gcc can optimize out the construction of all of the objects when the vectors are created. What happens if we change Pixel to:

struct Pixel
{
    unsigned char r, g, b;
};

Result: about 10% faster. Still slower than an array. Hm.

# ./vector 
UseArray completed in 3.239 seconds
UseVector completed in 5.567 seconds

Idea #4 - Use iterator instead of loop index

How about using a vector<Pixel>::iterator instead of a loop index?

for (std::vector<Pixel>::iterator j = pixels.begin(); j != pixels.end(); ++j)
{
    j->r = 255;
    j->g = 0;
    j->b = 0;
}

Result:

# ./vector 
UseArray completed in 3.264 seconds
UseVector completed in 5.443 seconds

Nope, no different. At least it's not slower. I thought this would have performance similar to #2 where I used a Pixel& reference.

Conclusion

Even if some smart cookie figures out how to make the vector loop as fast as the array one, this does not speak well of the default behavior of std::vector. So much for the compiler being smart enough to optimize out all the C++ness and make STL containers as fast as raw arrays.

The bottom line is that the compiler is unable to optimize away the no-op default constructor calls when using std::vector. If you use plain new[] it optimizes them away just fine. But not with std::vector. Even if you can rewrite your code to eliminate the constructor calls that flies in face of the mantra around here: "The compiler is smarter than you. The STL is just as fast as plain C. Don't worry about it."

回复收藏 0 原文

对你而言 2024-09-25 03:49:46

这是一个古老但流行的问题。

此时，许多程序员将使用 C++11 工作。在 C++11 中，OP 的编写代码对于 UseArray 或 UseVector 运行速度同样快。

UseVector completed in 3.74482 seconds
UseArray completed in 3.70414 seconds

根本问题是，当您的 Pixel 结构未初始化时，std::vector::resize( size_t, T const&=T() ) 采用默认值构造Pixel并复制它。编译器没有注意到它被要求复制未初始化的数据，因此它实际上执行了复制。

在 C++11 中，std::vector::resize 有两个重载。第一个是 std::vector::resize(size_t)，另一个是 std::vector::resize(size_t, T const&) >。这意味着当您在没有第二个参数的情况下调用 resize 时，它只是默认构造，并且编译器足够聪明，可以意识到默认构造不执行任何操作，因此它会跳过缓冲区的传递。

（添加两个重载是为了处理可移动、可构造和不可复制类型——处理未初始化数据时的性能改进是一个额外的好处）。

push_back 解决方案还进行了 fencepost 检查，这会减慢速度，因此它仍然比 malloc 版本慢。

现场示例（我还用 chrono::high_resolution_clock 替换了计时器）。

请注意，如果您有一个通常需要初始化的结构，但您想在增加缓冲区后处理它，则可以使用自定义 std::vector 分配器来完成此操作。如果您想将其移动到更正常的 std::vector 中，我相信仔细使用 allocator_traits 并覆盖 == 可能会导致关闭，但我不确定。

This is an old but popular question.

At this point, many programmers will be working in C++11. And in C++11 the OP's code as written runs equally fast for UseArray or UseVector.

UseVector completed in 3.74482 seconds
UseArray completed in 3.70414 seconds

The fundamental problem was that while your Pixel structure was uninitialized, std::vector<T>::resize( size_t, T const&=T() ) takes a default constructed Pixel and copies it. The compiler did not notice it was being asked to copy uninitialized data, so it actually performed the copy.

In C++11, std::vector<T>::resize has two overloads. The first is std::vector<T>::resize(size_t), the other is std::vector<T>::resize(size_t, T const&). This means when you invoke resize without a second argument, it simply default constructs, and the compiler is smart enough to realize that default construction does nothing, so it skips the pass over the buffer.

(The two overloads where added to handle movable, constructable and non-copyable types -- the performance improvement when working on uninitialized data is a bonus).

The push_back solution also does fencepost checking, which slows it down, so it remains slower than the malloc version.

live example (I also replaced the timer with chrono::high_resolution_clock).

Note that if you have a structure that usually requires initialization, but you want to handle it after growing your buffer, you can do this with a custom std::vector allocator. If you want to then move it into a more normal std::vector, I believe careful use of allocator_traits and overriding of == might pull that off, but am unsure.

回复收藏 0 原文

蓝海似她心 2024-09-25 03:49:46

公平地说，您无法将 C++ 实现与 C 实现进行比较，因为我将其称为 malloc 版本。 malloc 不创建对象 - 它只分配原始内存。然后将内存视为对象而不调用构造函数是糟糕的 C++（可能无效 - 我将把它留给语言律师）。

也就是说，简单地将 malloc 更改为 new Pixel[dimensions*dimensions] 并 free 为 delete [] Pixel 与您拥有的 Pixel 的简单实现没有太大区别。以下是我的机器（E6600，64 位）上的结果：

UseArray completed in 0.269 seconds
UseVector completed in 1.665 seconds
UseVectorPushBack completed in 7.309 seconds
The whole thing completed in 9.244 seconds

但稍作更改，表格变为：

Pixel.h

struct Pixel
{
    Pixel();
    Pixel(unsigned char r, unsigned char g, unsigned char b);

    unsigned char r, g, b;
};

Pixel.cc

#include "Pixel.h"

Pixel::Pixel() {}
Pixel::Pixel(unsigned char r, unsigned char g, unsigned char b) 
  : r(r), g(g), b(b) {}

main.cc

#include "Pixel.h"
[rest of test harness without class Pixel]
[UseArray now uses new/delete not malloc/free]

以这种方式编译：

$ g++ -O3 -c -o Pixel.o Pixel.cc
$ g++ -O3 -c -o main.o main.cc
$ g++ -o main main.o Pixel.o

我们得到非常不同的结果：

UseArray completed in 2.78 seconds
UseVector completed in 1.651 seconds
UseVectorPushBack completed in 7.826 seconds
The whole thing completed in 12.258 seconds

使用非内联构造函数Pixel、std::vector 现在胜过原始数组。

看来通过 std::vector 和 std:allocator 进行分配的复杂性太大，无法像简单的 new Pixel[n] 那样有效地进行优化。然而，我们可以看到，问题只是在于分配，而不是向量访问，通过调整几个测试函数，通过将其移出循环来创建向量/数组一次：

void UseVector()
{
    TestTimer t("UseVector");

    int dimension = 999;
    std::vector<Pixel> pixels;
    pixels.resize(dimension * dimension);

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

我们

void UseArray()
{
    TestTimer t("UseArray");

    int dimension = 999;
    Pixel * pixels = new Pixel[dimension * dimension];

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
    delete [] pixels;
}

现在得到这些结果：

UseArray completed in 0.254 seconds
UseVector completed in 0.249 seconds
UseVectorPushBack completed in 7.298 seconds
The whole thing completed in 7.802 seconds

我们可以学到什么由此可见， std::vector 与访问原始数组相当，但如果需要多次创建和删除向量/数组，则在元素的构造函数中创建复杂对象将比创建简单数组更耗时不是内联的。我认为这并不令人惊讶。

To be fair, you cannot compare a C++ implementation to a C implementation, as I would call your malloc version. malloc does not create objects - it only allocates raw memory. That you then treat that memory as objects without calling the constructor is poor C++ (possibly invalid - I'll leave that to the language lawyers).

That said, simply changing the malloc to new Pixel[dimensions*dimensions] and free to delete [] pixels does not make much difference with the simple implementation of Pixel that you have. Here's the results on my box (E6600, 64-bit):

UseArray completed in 0.269 seconds
UseVector completed in 1.665 seconds
UseVectorPushBack completed in 7.309 seconds
The whole thing completed in 9.244 seconds

But with a slight change, the tables turn:

Pixel.h

struct Pixel
{
    Pixel();
    Pixel(unsigned char r, unsigned char g, unsigned char b);

    unsigned char r, g, b;
};

Pixel.cc

#include "Pixel.h"

Pixel::Pixel() {}
Pixel::Pixel(unsigned char r, unsigned char g, unsigned char b) 
  : r(r), g(g), b(b) {}

main.cc

#include "Pixel.h"
[rest of test harness without class Pixel]
[UseArray now uses new/delete not malloc/free]

Compiled this way:

$ g++ -O3 -c -o Pixel.o Pixel.cc
$ g++ -O3 -c -o main.o main.cc
$ g++ -o main main.o Pixel.o

we get very different results:

UseArray completed in 2.78 seconds
UseVector completed in 1.651 seconds
UseVectorPushBack completed in 7.826 seconds
The whole thing completed in 12.258 seconds

With a non-inlined constructor for Pixel, std::vector now beats a raw array.

It would appear that the complexity of allocation through std::vector and std:allocator is too much to be optimised as effectively as a simple new Pixel[n]. However, we can see that the problem is simply with the allocation not the vector access by tweaking a couple of the test functions to create the vector/array once by moving it outside the loop:

void UseVector()
{
    TestTimer t("UseVector");

    int dimension = 999;
    std::vector<Pixel> pixels;
    pixels.resize(dimension * dimension);

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

and

void UseArray()
{
    TestTimer t("UseArray");

    int dimension = 999;
    Pixel * pixels = new Pixel[dimension * dimension];

    for(int i = 0; i < 1000; ++i)
    {
        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
    delete [] pixels;
}

We get these results now:

UseArray completed in 0.254 seconds
UseVector completed in 0.249 seconds
UseVectorPushBack completed in 7.298 seconds
The whole thing completed in 7.802 seconds

What we can learn from this is that std::vector is comparable to a raw array for access, but if you need to create and delete the vector/array many times, creating a complex object will be more time consuming that creating a simple array when the element's constructor is not inlined. I don't think that this is very surprising.

回复收藏 0 原文

若沐 2024-09-25 03:49:46

当我第一次看到你的代码时，这很难说是一个公平的比较；我绝对认为你不是在比较苹果与苹果。所以我想，让我们在所有测试中调用构造函数和析构函数；然后进行比较。

const size_t dimension = 1000;

void UseArray() {
    TestTimer t("UseArray");
    for(size_t j = 0; j < dimension; ++j) {
        Pixel* pixels = new Pixel[dimension * dimension];
        for(size_t i = 0 ; i < dimension * dimension; ++i) {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = (unsigned char) (i % 255);
        }
        delete[] pixels;
    }
}

void UseVector() {
    TestTimer t("UseVector");
    for(size_t j = 0; j < dimension; ++j) {
        std::vector<Pixel> pixels(dimension * dimension);
        for(size_t i = 0; i < dimension * dimension; ++i) {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = (unsigned char) (i % 255);
        }
    }
}

int main() {
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();

    return 0;
}

我的想法是，通过这种设置，它们应该完全相同。事实证明，我错了。

UseArray completed in 3.06 seconds
UseVector completed in 4.087 seconds
The whole thing completed in 10.14 seconds

那么为什么会出现 30% 的性能损失呢？ STL 在标头中包含所有内容，因此编译器应该能够理解所需的所有内容。

我的想法是，循环如何将所有值初始化为默认构造函数。所以我进行了一个测试：

class Tester {
public:
    static int count;
    static int count2;
    Tester() { count++; }
    Tester(const Tester&) { count2++; }
};
int Tester::count = 0;
int Tester::count2 = 0;

int main() {
    std::vector<Tester> myvec(300);
    printf("Default Constructed: %i\nCopy Constructed: %i\n", Tester::count, Tester::count2);

    return 0;
}

结果正如我所怀疑的那样：

Default Constructed: 1
Copy Constructed: 300

这显然是速度减慢的根源，事实上向量使用复制构造函数来初始化默认构造对象中的元素。

这意味着，在向量的构造过程中会发生以下伪操作顺序：

Pixel pixel;
for (auto i = 0; i < N; ++i) vector[i] = pixel;

由于编译器创建的隐式复制构造函数，该伪操作顺序被扩展为以下内容：

Pixel pixel;
for (auto i = 0; i < N; ++i) {
    vector[i].r = pixel.r;
    vector[i].g = pixel.g;
    vector[i].b = pixel.b;
}

因此默认的 Pixel 仍然是 un - 已初始化，而其余部分则使用默认 Pixel 的未初始化值进行初始化。

与 New[]/Delete[] 的替代情况相比：

int main() {
    Tester* myvec = new Tester[300];

    printf("Default Constructed: %i\nCopy Constructed:%i\n", Tester::count, Tester::count2);

    delete[] myvec;

    return 0;
}

Default Constructed: 300
Copy Constructed: 0

它们都保留为未初始化的值，并且没有对序列进行两次迭代。

有了这些信息，我们如何测试它呢？让我们尝试重写隐式复制构造函数。

Pixel(const Pixel&) {}

结果如何？

UseArray completed in 2.617 seconds
UseVector completed in 2.682 seconds
The whole thing completed in 5.301 seconds

总之，如果您经常创建数百个向量：重新考虑您的算法。

无论如何，STL 实现并不会因为某些未知原因而变慢，它只是完全按照您的要求进行操作；希望你了解得更多。

It was hardly a fair comparison when I first looked at your code; I definitely thought you weren't comparing apples with apples. So I thought, let's get constructors and destructors being called on all tests; and then compare.

const size_t dimension = 1000;

void UseArray() {
    TestTimer t("UseArray");
    for(size_t j = 0; j < dimension; ++j) {
        Pixel* pixels = new Pixel[dimension * dimension];
        for(size_t i = 0 ; i < dimension * dimension; ++i) {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = (unsigned char) (i % 255);
        }
        delete[] pixels;
    }
}

void UseVector() {
    TestTimer t("UseVector");
    for(size_t j = 0; j < dimension; ++j) {
        std::vector<Pixel> pixels(dimension * dimension);
        for(size_t i = 0; i < dimension * dimension; ++i) {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = (unsigned char) (i % 255);
        }
    }
}

int main() {
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();

    return 0;
}

My thoughts were, that with this setup, they should be exactly the same. It turns out, I was wrong.

UseArray completed in 3.06 seconds
UseVector completed in 4.087 seconds
The whole thing completed in 10.14 seconds

So why did this 30% performance loss even occur? The STL has everything in headers, so it should have been possible for the compiler to understand everything that was required.

My thoughts were that it is in how the loop initialises all values to the default constructor. So I performed a test:

class Tester {
public:
    static int count;
    static int count2;
    Tester() { count++; }
    Tester(const Tester&) { count2++; }
};
int Tester::count = 0;
int Tester::count2 = 0;

int main() {
    std::vector<Tester> myvec(300);
    printf("Default Constructed: %i\nCopy Constructed: %i\n", Tester::count, Tester::count2);

    return 0;
}

The results were as I suspected:

Default Constructed: 1
Copy Constructed: 300

This is clearly the source of the slowdown, the fact that the vector uses the copy constructor to initialise the elements from a default constructed object.

This means, that the following pseudo-operation order is happening during construction of the vector:

Pixel pixel;
for (auto i = 0; i < N; ++i) vector[i] = pixel;

Which, due to the implicit copy constructor made by the compiler, is expanded to the following:

Pixel pixel;
for (auto i = 0; i < N; ++i) {
    vector[i].r = pixel.r;
    vector[i].g = pixel.g;
    vector[i].b = pixel.b;
}

So the default Pixel remains un-initialised, while the rest are initialised with the default Pixel's un-initialised values.

Compared to the alternative situation with New[]/Delete[]:

int main() {
    Tester* myvec = new Tester[300];

    printf("Default Constructed: %i\nCopy Constructed:%i\n", Tester::count, Tester::count2);

    delete[] myvec;

    return 0;
}

Default Constructed: 300
Copy Constructed: 0

They are all left to their un-initialised values, and without the double iteration over the sequence.

Armed with this information, how can we test it? Let's try over-writing the implicit copy constructor.

Pixel(const Pixel&) {}

And the results?

UseArray completed in 2.617 seconds
UseVector completed in 2.682 seconds
The whole thing completed in 5.301 seconds

So in summary, if you're making hundreds of vectors very often: re-think your algorithm.

In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.

回复收藏 0 原文

很糊涂小朋友 2024-09-25 03:49:46

尝试一下：

void UseVectorCtor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
    }
}

我获得了与数组几乎完全相同的性能。

vector 的特点是它是一种比数组更通用的工具。这意味着您必须考虑如何使用它。它可以以多种不同的方式使用，提供数组甚至没有的功能。如果你“错误”地使用它来达到你的目的，你会产生大量的开销，但如果你正确地使用它，它通常基本上是一个零开销的数据结构。在这种情况下，问题在于您单独初始化了向量（导致所有元素调用其默认构造函数），然后用正确的值单独覆盖每个元素。对于编译器来说，这比对数组执行相同的操作要困难得多。这就是为什么向量提供了一个构造函数，可以让您准确地执行此操作：用值 X 初始化 N 元素。

当你使用它时，向量和数组一样快。

所以不，你还没有打破性能神话。但你已经证明，只有当你以最佳方式使用向量时，这才是正确的，这也是一个很好的观点。 :)

从好的方面来看，这确实是最简单的用法，但结果却是最快的。如果您将我的代码片段（一行）与 John Kugelman 的答案进行对比，其中包含大量的调整和优化，但仍然不能完全消除性能差异，那么很明显 vector 相当不错毕竟设计得很巧妙。您不必费力就能获得与阵列相同的速度。相反，您必须使用最简单的解决方案。

Try with this:

void UseVectorCtor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
    }
}

I get almost exactly the same performance as with array.

The thing about vector is that it's a much more general tool than an array. And that means you have to consider how you use it. It can be used in a lot of different ways, providing functionality that an array doesn't even have. And if you use it "wrong" for your purpose, you incur a lot of overhead, but if you use it correctly, it is usually basically a zero-overhead data structure. In this case, the problem is that you separately initialized the vector (causing all elements to have their default ctor called), and then overwriting each element individually with the correct value. That is much harder for the compiler to optimize away than when you do the same thing with an array. Which is why the vector provides a constructor which lets you do exactly that: initialize N elements with value X.

And when you use that, the vector is just as fast as an array.

So no, you haven't busted the performance myth. But you have shown that it's only true if you use the vector optimally, which is a pretty good point too. :)

On the bright side, it's really the simplest usage that turns out to be fastest. If you contrast my code snippet (a single line) with John Kugelman's answer, containing heaps and heaps of tweaks and optimizations, which still don't quite eliminate the performance difference, it's pretty clear that vector is pretty cleverly designed after all. You don't have to jump through hoops to get speed equal to an array. On the contrary, you have to use the simplest possible solution.

回复收藏 0 原文

浮萍、无处依 2024-09-25 03:49:46

尝试禁用检查的迭代器并在发布模式下构建。您不应该看到太大的性能差异。

回复收藏 0 原文

﹏雨一样淡蓝的深情 2024-09-25 03:49:46

GNU 的 STL（和其他），给定 vector(n)，默认构造一个原型对象 T() - 编译器将优化掉空构造函数 - 但随后现在为该对象保留的内存地址中发生的任何垃圾的副本都会由 STL 的 __uninitialized_fill_n_aux 获取，它会循环填充该对象的副本作为向量中的默认值。所以，“我的”STL不是循环构造，而是构造然后循环/复制。这是违反直觉的，但我应该记得，因为我在最近的一个 stackoverflow 问题上评论了这一点：构造/复制对于引用计数对象等来说可以更有效。

所以：

vector<T> x(n);

vector<T> x;
x.resize(n);

is - 在许多 STL 实现上 - 类似于：

T temp;
for (int i = 0; i < n; ++i)
    x[i] = temp;

问题在于，当前一代的编译器优化器似乎无法从 temp 是未初始化的垃圾的角度出发，并且无法优化循环和默认复制构造函数调用。您可以可信地争辩说，编译器绝对不应该对此进行优化，因为编写上述代码的程序员有一个合理的期望，即所有对象在循环后都将相同，即使是垃圾（关于“相同”/operator== 的通常警告） memcmp/operator= 等适用）。不能期望编译器对 std::vector<> 的更大上下文有任何额外的了解。或稍后使用的数据表明此优化是安全的。

这可以与更明显的直接实现形成对比：

for (int i = 0; i < n; ++i)
    x[i] = T();

我们可以期望编译器对其进行优化。

为了更明确地说明向量行为这方面的理由，请考虑：

std::vector<big_reference_counted_object> x(10000);

显然，如果我们创建 10000 个独立对象与 10000 个引用相同数据的对象，这是一个主要区别。有一个合理的论点认为，保护普通 C++ 用户免于意外地执行如此昂贵的操作的优势超过了难以优化的复制构造在现实世界中所付出的非常小的成本。

原始答案（供参考/理解评论）：
没有机会。向量与数组一样快，至少如果您明智地保留空间的话。 ...

GNU's STL (and others), given vector<T>(n), default constructs a prototypal object T() - the compiler will optimise away the empty constructor - but then a copy of whatever garbage happened to be in the memory addresses now reserved for the object is taken by the STL's __uninitialized_fill_n_aux, which loops populating copies of that object as the default values in the vector. So, "my" STL is not looping constructing, but constructing then loop/copying. It's counter intuitive, but I should have remembered as I commented on a recent stackoverflow question about this very point: the construct/copy can be more efficient for reference counted objects etc..

So:

vector<T> x(n);

vector<T> x;
x.resize(n);

is - on many STL implementations - something like:

T temp;
for (int i = 0; i < n; ++i)
    x[i] = temp;

The issue being that the current generation of compiler optimisers don't seem to work from the insight that temp is uninitialised garbage, and fail to optimise out the loop and default copy constructor invocations. You could credibly argue that compilers absolutely shouldn't optimise this away, as a programmer writing the above has a reasonable expectation that all the objects will be identical after the loop, even if garbage (usual caveats about 'identical'/operator== vs memcmp/operator= etc apply). The compiler can't be expected to have any extra insight into the larger context of std::vector<> or the later usage of the data that would suggest this optimisation safe.

This can be contrasted with the more obvious, direct implementation:

for (int i = 0; i < n; ++i)
    x[i] = T();

Which we can expect a compiler to optimise out.

To be a bit more explicit about the justification for this aspect of vector's behaviour, consider:

std::vector<big_reference_counted_object> x(10000);

Clearly it's a major difference if we make 10000 independent objects versus 10000 referencing the same data. There's a reasonable argument that the advantage of protecting casual C++ users from accidentally doing something so expensive outweights the very small real-world cost of hard-to-optimise copy construction.

ORIGINAL ANSWER (for reference / making sense of the comments):
No chance. vector is as fast as an array, at least if you reserve space sensibly. ...

回复收藏 0 原文

我的黑色迷你裙 2024-09-25 03:49:46

马丁·约克的回答让我烦恼，因为它似乎就像试图掩盖初始化问题一样。但他认为冗余默认构造是性能问题的根源是正确的。

[编辑：Martin 的回答不再建议更改默认构造函数。]

对于眼前的问题，您当然可以调用 vector 的 2 参数版本相反：

std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));

如果您想用常量值进行初始化（这是常见情况），那么这很有效。但更普遍的问题是：如何使用比常量值更复杂的东西有效地初始化？

为此，您可以使用 back_insert_iterator，它是一个迭代器适配器。这是一个带有 int 向量的示例，尽管总体思路对于 Pixel 也适用：

#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
    squares() { i = 0; }
    int operator()() const { ++i; return i * i; }

private:
    int i;
};

...

std::vector<int> v;
v.reserve(someSize);     // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());

或者您可以使用 copy()或 transform() 而不是 generate_n()。

缺点是构造初始值的逻辑需要移到一个单独的类中，这比直接放置它不太方便（尽管 C++1x 中的 lambda 使这变得更好）。另外，我预计这仍然不会像基于 malloc() 的非 STL 版本那么快，但我预计它会接近，因为它只为每个元素执行一次构造。

Martin York's answer bothers me because it seems like an attempt to brush the initialisation problem under the carpet. But he is right to identify redundant default construction as the source of performance problems.

[EDIT: Martin's answer no longer suggests changing the default constructor.]

For the immediate problem at hand, you could certainly call the 2-parameter version of the vector<Pixel> ctor instead:

std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));

That works if you want to initialise with a constant value, which is a common case. But the more general problem is: How can you efficiently initialise with something more complicated than a constant value?

For this you can use a back_insert_iterator, which is an iterator adaptor. Here's an example with a vector of ints, although the general idea works just as well for Pixels:

#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
    squares() { i = 0; }
    int operator()() const { ++i; return i * i; }

private:
    int i;
};

...

std::vector<int> v;
v.reserve(someSize);     // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());

Alternatively you could use copy() or transform() instead of generate_n().

The downside is that the logic to construct the initial values needs to be moved into a separate class, which is less convenient than having it in-place (although lambdas in C++1x make this much nicer). Also I expect this will still not be as fast as a malloc()-based non-STL version, but I expect it will be close, since it only does one construction for each element.

回复收藏 0 原文

指尖凝香 2024-09-25 03:49:46

矢量还调用 Pixel 构造函数。

每个都会导致您正在计时的近一百万次 ctor 运行。

编辑：然后是外部 1...1000 循环，因此请进行 10 亿次 ctor 调用！

编辑2：看看 UseArray 案例的反汇编会很有趣。优化器可以优化整个事情，因为它除了烧毁 CPU 之外没有任何效果。

回复收藏 0 原文

温柔嚣张 2024-09-25 03:49:46

以下是向量中的 push_back 方法的工作原理：

向量在初始化时分配 X 空间量。
如下所述，它检查当前底层数组中是否有空间容纳该项目。
它在push_back调用中创建了该项目的副本。

调用 push_back X 个项目后：

向量将 kX 空间量重新分配到第二个数组中。
它将第一个数组的条目复制到第二个数组中。
丢弃第一个数组。
现在使用第二个数组作为存储，直到它达到 kX 个条目。

重复。如果您不预留空间，它肯定会变慢。更重要的是，如果复制该项目的成本很高，那么像这样的“push_back”就会把你活活吃掉。

至于向量与数组的问题，我必须同意其他人的观点。在发布版本中运行，打开优化，并添加更多标志，这样微软的友好人员就不会#@%$^ 为你做这些。

还有一件事，如果您不需要调整大小，请使用 Boost.Array。

回复收藏 0 原文

讽刺将军 2024-09-25 03:49:46

一些分析器数据（像素与 32 位对齐）：

g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds

Blah

andrey@nv:~$ opannotate --source libcchem/src/a.out  | grep "Total samples for file" -A3
Overflow stats not available
 * Total samples for file : "/usr/include/c++/4.4/ext/new_allocator.h"
 *
 * 141008 52.5367
 */
--
 * Total samples for file : "/home/andrey/libcchem/src/test.cpp"
 *
 *  61556 22.9345
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_vector.h"
 *
 *  41956 15.6320
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_uninitialized.h"
 *
 *  20956  7.8078
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_construct.h"
 *
 *   2923  1.0891
 */

In allocator:

               :      // _GLIBCXX_RESOLVE_LIB_DEFECTS
               :      // 402. wrong new expression in [some_] allocator::construct
               :      void
               :      construct(pointer __p, const _Tp& __val)
141008 52.5367 :      { ::new((void *)__p) _Tp(__val); }

vector:

               :void UseVector()
               :{ /* UseVector() total:  60121 22.3999 */
...
               :
               :
 10790  4.0201 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
   495  0.1844 :            pixels[i].r = 255;
               :
 12618  4.7012 :            pixels[i].g = 0;
               :
  2253  0.8394 :            pixels[i].b = 0;
               :
               :        }

array

               :void UseArray()
               :{ /* UseArray() total:  35191 13.1114 */
               :
...
               :
   136  0.0507 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
  9897  3.6874 :            pixels[i].r = 255;
               :
  3511  1.3081 :            pixels[i].g = 0;
               :
 21647  8.0652 :            pixels[i].b = 0;

大部分开销都在复制构造函数中。例如，

    std::vector < Pixel > pixels;//(dimension * dimension, Pixel());

    pixels.reserve(dimension * dimension);

    for (int i = 0; i < dimension * dimension; ++i) {

        pixels[i].r = 255;

        pixels[i].g = 0;

        pixels[i].b = 0;
    }

它具有与数组相同的性能。

Some profiler data (pixel is aligned to 32 bits):

g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds

Blah

andrey@nv:~$ opannotate --source libcchem/src/a.out  | grep "Total samples for file" -A3
Overflow stats not available
 * Total samples for file : "/usr/include/c++/4.4/ext/new_allocator.h"
 *
 * 141008 52.5367
 */
--
 * Total samples for file : "/home/andrey/libcchem/src/test.cpp"
 *
 *  61556 22.9345
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_vector.h"
 *
 *  41956 15.6320
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_uninitialized.h"
 *
 *  20956  7.8078
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_construct.h"
 *
 *   2923  1.0891
 */

In allocator:

               :      // _GLIBCXX_RESOLVE_LIB_DEFECTS
               :      // 402. wrong new expression in [some_] allocator::construct
               :      void
               :      construct(pointer __p, const _Tp& __val)
141008 52.5367 :      { ::new((void *)__p) _Tp(__val); }

vector:

               :void UseVector()
               :{ /* UseVector() total:  60121 22.3999 */
...
               :
               :
 10790  4.0201 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
   495  0.1844 :            pixels[i].r = 255;
               :
 12618  4.7012 :            pixels[i].g = 0;
               :
  2253  0.8394 :            pixels[i].b = 0;
               :
               :        }

array

               :void UseArray()
               :{ /* UseArray() total:  35191 13.1114 */
               :
...
               :
   136  0.0507 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
  9897  3.6874 :            pixels[i].r = 255;
               :
  3511  1.3081 :            pixels[i].g = 0;
               :
 21647  8.0652 :            pixels[i].b = 0;

Most of the overhead is in the copy constructor. For example,

    std::vector < Pixel > pixels;//(dimension * dimension, Pixel());

    pixels.reserve(dimension * dimension);

    for (int i = 0; i < dimension * dimension; ++i) {

        pixels[i].r = 255;

        pixels[i].g = 0;

        pixels[i].b = 0;
    }

It has the same performance as an array.

回复收藏 0 原文

冷默言语 2024-09-25 03:49:46

我的笔记本电脑是 Lenova G770（4 GB RAM）。

操作系统是Windows 7 64位（带笔记本电脑的）

编译器是MinGW 4.6.1 。

IDE 是 Code::Blocks。

我测试了第一篇文章的源代码。

结果

O2 优化

UseArray 在 2.841 秒内完成

UseVector 在 2.548 秒内

完成 UseVectorPushBack 在 11.95 秒内

完成整个事情在 17.342 秒内完成

系统暂停

O3 优化

UseArray 在 1.452 秒内完成

UseVector 在 2.514 秒内完成

完成

UseVectorPushBack 在 12.967 秒内完成整个事情在 12.967 秒内 16.937秒

看起来在O3优化下向量的性能更差。

如果把循环改成

    pixels[i].r = i;
    pixels[i].g = i;
    pixels[i].b = i;

O2和O3下数组和向量的速度几乎是一样的。

My laptop is Lenova G770 (4 GB RAM).

The OS is Windows 7 64-bit (the one with laptop)

Compiler is MinGW 4.6.1.

The IDE is Code::Blocks.

I test the source codes of the first post.

The results

O2 optimization

UseArray completed in 2.841 seconds

UseVector completed in 2.548 seconds

UseVectorPushBack completed in 11.95 seconds

The whole thing completed in 17.342 seconds

system pause

O3 optimization

UseArray completed in 1.452 seconds

UseVector completed in 2.514 seconds

UseVectorPushBack completed in 12.967 seconds

The whole thing completed in 16.937 seconds

It looks like the performance of vector is worse under O3 optimization.

If you change the loop to

    pixels[i].r = i;
    pixels[i].g = i;
    pixels[i].b = i;

The speed of array and vector under O2 and O3 are almost the same.

回复收藏 0 原文

踏雪无痕 2024-09-25 03:49:46

一个更好的基准（我认为......），由于优化，编译器可以更改代码，因为分配的向量/数组的结果不会在任何地方使用。
结果：

$ g++ test.cpp -o test -O3 -march=native
$ ./test 
UseArray inner completed in 0.652 seconds
UseArray completed in 0.773 seconds
UseVector inner completed in 0.638 seconds
UseVector completed in 0.757 seconds
UseVectorPushBack inner completed in 6.732 seconds
UseVectorPush completed in 6.856 seconds
The whole thing completed in 8.387 seconds

编译器：

gcc version 6.2.0 20161019 (Debian 6.2.0-9)

CPU：

model name  : Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz

代码：

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
    public:
        TestTimer(const std::string & name) : name(name),
            start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
        {
        }

        ~TestTimer()
        {
            using namespace std;
            using namespace boost;

            posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
            posix_time::time_duration d = now - start;

            cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
                " seconds" << endl;
        }

    private:
        std::string name;
        boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector(std::vector<std::vector<Pixel> >& results)
{
    TestTimer t("UseVector inner");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel>& pixels = results.at(i);
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack(std::vector<std::vector<Pixel> >& results)
{
    TestTimer t("UseVectorPushBack inner");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel>& pixels = results.at(i);
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray(Pixel** results)
{
    TestTimer t("UseArray inner");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        results[i] = pixels;

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        // free(pixels);
    }
}

void UseArray()
{
    TestTimer t("UseArray");
    Pixel** array = (Pixel**)malloc(sizeof(Pixel*)* 1000);
    UseArray(array);
    for(int i=0;i<1000;++i)
        free(array[i]);
    free(array);
}

void UseVector()
{
    TestTimer t("UseVector");
    {
        std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
        UseVector(vector);
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPush");
    {
        std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
        UseVectorPushBack(vector);
    }
}


int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

A better benchmark (I think...), compiler due to optimizations can change code, becouse results of allocated vectors/arrays are not used anywhere.
Results:

$ g++ test.cpp -o test -O3 -march=native
$ ./test 
UseArray inner completed in 0.652 seconds
UseArray completed in 0.773 seconds
UseVector inner completed in 0.638 seconds
UseVector completed in 0.757 seconds
UseVectorPushBack inner completed in 6.732 seconds
UseVectorPush completed in 6.856 seconds
The whole thing completed in 8.387 seconds

Compiler:

gcc version 6.2.0 20161019 (Debian 6.2.0-9)

CPU:

model name  : Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz

And the code:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
    public:
        TestTimer(const std::string & name) : name(name),
            start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
        {
        }

        ~TestTimer()
        {
            using namespace std;
            using namespace boost;

            posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
            posix_time::time_duration d = now - start;

            cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
                " seconds" << endl;
        }

    private:
        std::string name;
        boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector(std::vector<std::vector<Pixel> >& results)
{
    TestTimer t("UseVector inner");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel>& pixels = results.at(i);
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack(std::vector<std::vector<Pixel> >& results)
{
    TestTimer t("UseVectorPushBack inner");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel>& pixels = results.at(i);
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray(Pixel** results)
{
    TestTimer t("UseArray inner");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        results[i] = pixels;

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        // free(pixels);
    }
}

void UseArray()
{
    TestTimer t("UseArray");
    Pixel** array = (Pixel**)malloc(sizeof(Pixel*)* 1000);
    UseArray(array);
    for(int i=0;i<1000;++i)
        free(array[i]);
    free(array);
}

void UseVector()
{
    TestTimer t("UseVector");
    {
        std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
        UseVector(vector);
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPush");
    {
        std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
        UseVectorPushBack(vector);
    }
}


int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

回复收藏 0 原文

云之铃。 2024-09-25 03:49:46

我做了一些我长期以来想做的广泛测试。不妨分享一下这个。

这是我的双启动机器 i7-3770、16GB RAM、x86_64，运行 Windows 8.1 和 Ubuntu 16.04。更多信息和结论，请参见下面的评论。测试了 MSVS 2017 和 g++（在 Windows 和 Linux 上）。

测试程序

#include <iostream>
#include <chrono>
//#include <algorithm>
#include <array>
#include <locale>
#include <vector>
#include <queue>
#include <deque>

// Note: total size of array must not exceed 0x7fffffff B = 2,147,483,647B
//  which means that largest int array size is 536,870,911
// Also image size cannot be larger than 80,000,000B
constexpr int long g_size = 100000;
int g_A[g_size];


int main()
{
    std::locale loc("");
    std::cout.imbue(loc);
    constexpr int long size = 100000;  // largest array stack size

    // stack allocated c array
    std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
    int A[size];
    for (int i = 0; i < size; i++)
        A[i] = i;

    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style stack array size=" << sizeof(A) << "B\n\n";

    // global stack c array
    start = std::chrono::steady_clock::now();
    for (int i = 0; i < g_size; i++)
        g_A[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "global c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "global c-style stack array size=" << sizeof(g_A) << "B\n\n";

    // raw c array heap array
    start = std::chrono::steady_clock::now();
    int* AA = new int[size];    // bad_alloc() if it goes higher than 1,000,000,000
    for (int i = 0; i < size; i++)
        AA[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style heap array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style heap array size=" << sizeof(AA) << "B\n\n";
    delete[] AA;

    // std::array<>
    start = std::chrono::steady_clock::now();
    std::array<int, size> AAA;
    for (int i = 0; i < size; i++)
        AAA[i] = i;
    //std::sort(AAA.begin(), AAA.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::array size=" << sizeof(AAA) << "B\n\n";

    // std::vector<>
    start = std::chrono::steady_clock::now();
    std::vector<int> v;
    for (int i = 0; i < size; i++)
        v.push_back(i);
    //std::sort(v.begin(), v.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::vector duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::vector size=" << v.size() * sizeof(v.back()) << "B\n\n";

    // std::deque<>
    start = std::chrono::steady_clock::now();
    std::deque<int> dq;
    for (int i = 0; i < size; i++)
        dq.push_back(i);
    //std::sort(dq.begin(), dq.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::deque duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::deque size=" << dq.size() * sizeof(dq.back()) << "B\n\n";

    // std::queue<>
    start = std::chrono::steady_clock::now();
    std::queue<int> q;
    for (int i = 0; i < size; i++)
        q.push(i);

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::queue duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::queue size=" << q.size() * sizeof(q.front()) << "B\n\n";
}

结果

//////////////////////////////////////////////////////////////////////////////////////////
// with MSVS 2017:
// >> cl /std:c++14 /Wall -O2 array_bench.cpp
//
// c-style stack array duration=0.15ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.130ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.90ms
// c-style heap array size=4B
//
// std::array duration=0.20ms
// std::array size=400,000B
//
// std::vector duration=0.544ms
// std::vector size=400,000B
//
// std::deque duration=1.375ms
// std::deque size=400,000B
//
// std::queue duration=1.491ms
// std::queue size=400,000B
//
//////////////////////////////////////////////////////////////////////////////////////////
//
// with g++ version:
//      - (tdm64-1) 5.1.0 on Windows
//      - (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 on Ubuntu 16.04
// >> g++ -std=c++14 -Wall -march=native -O2 array_bench.cpp -o array_bench
//
// c-style stack array duration=0ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.124ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.648ms
// c-style heap array size=8B
//
// std::array duration=1ms
// std::array size=400,000B
//
// std::vector duration=0.402ms
// std::vector size=400,000B
//
// std::deque duration=0.234ms
// std::deque size=400,000B
//
// std::queue duration=0.304ms
// std::queue size=400,000
//
//////////////////////////////////////////////////////////////////////////////////////////

注释

由平均 10 次运行得出。
我最初也使用 std::sort() 进行了测试（您可以看到它已被注释掉），但后来删除了它们，因为没有显着的相对差异。

我的结论和评论

注意到全局 c 样式数组花费的时间几乎与堆 c 样式数组一样多
在所有测试中，我注意到 std::array 连续之间的时间变化具有显着的稳定性运行，而其他尤其是 std:: 数据结构相比变化很大
O3 优化没有显示任何值得注意的时间差异
删除 Windows cl（无 -O2）和 g++ 上的优化（Win/Linux 无 -O2，无 -march=native ）显着增加时间。特别是对于 std::data 结构。总体而言，MSVS 上的时间比 g++ 更高，但 std::array 和 c 样式数组在 Windows 上运行得更快，无需优化
g++ 生成的代码比微软的编译器更快（显然，即使在 Windows 上它也运行得更快）。

结论

当然，这是优化构建的代码。既然问题是关于 std::vector 的，那么是的，它是！很多！比普通数组慢（优化/未优化）。但是当您进行基准测试时，您自然希望生成优化的代码。

对我来说，这场秀的明星是 std::array。

I did some extensive tests that I wanted to for a while now. Might as well share this.

This is my dual boot machine i7-3770, 16GB Ram, x86_64, on Windows 8.1 and on Ubuntu 16.04. More information and conclusions, remarks below. Tested both MSVS 2017 and g++ (both on Windows and on Linux).

Test Program

#include <iostream>
#include <chrono>
//#include <algorithm>
#include <array>
#include <locale>
#include <vector>
#include <queue>
#include <deque>

// Note: total size of array must not exceed 0x7fffffff B = 2,147,483,647B
//  which means that largest int array size is 536,870,911
// Also image size cannot be larger than 80,000,000B
constexpr int long g_size = 100000;
int g_A[g_size];


int main()
{
    std::locale loc("");
    std::cout.imbue(loc);
    constexpr int long size = 100000;  // largest array stack size

    // stack allocated c array
    std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
    int A[size];
    for (int i = 0; i < size; i++)
        A[i] = i;

    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style stack array size=" << sizeof(A) << "B\n\n";

    // global stack c array
    start = std::chrono::steady_clock::now();
    for (int i = 0; i < g_size; i++)
        g_A[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "global c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "global c-style stack array size=" << sizeof(g_A) << "B\n\n";

    // raw c array heap array
    start = std::chrono::steady_clock::now();
    int* AA = new int[size];    // bad_alloc() if it goes higher than 1,000,000,000
    for (int i = 0; i < size; i++)
        AA[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style heap array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style heap array size=" << sizeof(AA) << "B\n\n";
    delete[] AA;

    // std::array<>
    start = std::chrono::steady_clock::now();
    std::array<int, size> AAA;
    for (int i = 0; i < size; i++)
        AAA[i] = i;
    //std::sort(AAA.begin(), AAA.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::array size=" << sizeof(AAA) << "B\n\n";

    // std::vector<>
    start = std::chrono::steady_clock::now();
    std::vector<int> v;
    for (int i = 0; i < size; i++)
        v.push_back(i);
    //std::sort(v.begin(), v.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::vector duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::vector size=" << v.size() * sizeof(v.back()) << "B\n\n";

    // std::deque<>
    start = std::chrono::steady_clock::now();
    std::deque<int> dq;
    for (int i = 0; i < size; i++)
        dq.push_back(i);
    //std::sort(dq.begin(), dq.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::deque duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::deque size=" << dq.size() * sizeof(dq.back()) << "B\n\n";

    // std::queue<>
    start = std::chrono::steady_clock::now();
    std::queue<int> q;
    for (int i = 0; i < size; i++)
        q.push(i);

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::queue duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::queue size=" << q.size() * sizeof(q.front()) << "B\n\n";
}

Results

//////////////////////////////////////////////////////////////////////////////////////////
// with MSVS 2017:
// >> cl /std:c++14 /Wall -O2 array_bench.cpp
//
// c-style stack array duration=0.15ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.130ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.90ms
// c-style heap array size=4B
//
// std::array duration=0.20ms
// std::array size=400,000B
//
// std::vector duration=0.544ms
// std::vector size=400,000B
//
// std::deque duration=1.375ms
// std::deque size=400,000B
//
// std::queue duration=1.491ms
// std::queue size=400,000B
//
//////////////////////////////////////////////////////////////////////////////////////////
//
// with g++ version:
//      - (tdm64-1) 5.1.0 on Windows
//      - (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 on Ubuntu 16.04
// >> g++ -std=c++14 -Wall -march=native -O2 array_bench.cpp -o array_bench
//
// c-style stack array duration=0ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.124ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.648ms
// c-style heap array size=8B
//
// std::array duration=1ms
// std::array size=400,000B
//
// std::vector duration=0.402ms
// std::vector size=400,000B
//
// std::deque duration=0.234ms
// std::deque size=400,000B
//
// std::queue duration=0.304ms
// std::queue size=400,000
//
//////////////////////////////////////////////////////////////////////////////////////////

Notes

Assembled by an average of 10 runs.
I initially performed tests with std::sort() too (you can see it commented out) but removed them later because there were no significant relative differences.

My Conclusions and Remarks

notice how global c-style array takes almost as much time as the heap c-style array
Out of all tests I noticed a remarkable stability in std::array's time variations between consecutive runs, while others especially std:: data structs varied wildly in comparison
O3 optimization didn't show any noteworthy time differences
Removing optimization on Windows cl (no -O2) and on g++ (Win/Linux no -O2, no -march=native) increases times SIGNIFICANTLY. Particularly for std::data structs. Overall higher times on MSVS than g++, but std::array and c-style arrays faster on Windows without optimization
g++ produces faster code than microsoft's compiler (apparently it runs faster even on Windows).

Verdict

Of course this is code for an optimized build. And since the question was about std::vector then yes it is !much! slower than plain arrays (optimized/unoptimized). But when you're doing a benchmark, you naturally want to produce optimized code.

The star of the show for me though has been std::array.

回复收藏 0 原文

我一向站在原地 2024-09-25 03:49:46

通过正确的选项，向量和数组可以生成相同的asm。在这些情况下，它们当然具有相同的速度，因为无论哪种方式您都会获得相同的可执行文件。

回复收藏 0 原文

恍梦境° 2024-09-25 03:49:46

顺便说一句，在使用向量的类中，您的观察速度也会减慢，这也发生在像 int 这样的标准类型上。这是一个多线程代码：

#include <iostream>
#include <cstdio>
#include <map>
#include <string>
#include <typeinfo>
#include <vector>
#include <pthread.h>
#include <sstream>
#include <fstream>
using namespace std;

//pthread_mutex_t map_mutex=PTHREAD_MUTEX_INITIALIZER;

long long num=500000000;
int procs=1;

struct iterate
{
    int id;
    int num;
    void * member;
    iterate(int a, int b, void *c) : id(a), num(b), member(c) {}
};

//fill out viterate and piterate
void * viterate(void * input)
{
    printf("am in viterate\n");
    iterate * info=static_cast<iterate *> (input);
    // reproduce member type
    vector<int> test= *static_cast<vector<int>*> (info->member);
    for (int i=info->id; i<test.size(); i+=info->num)
    {
        //printf("am in viterate loop\n");
        test[i];
    }
    pthread_exit(NULL);
}

void * piterate(void * input)
{
    printf("am in piterate\n");
    iterate * info=static_cast<iterate *> (input);;
    int * test=static_cast<int *> (info->member);
    for (int i=info->id; i<num; i+=info->num) {
        //printf("am in piterate loop\n");
        test[i];
    }
    pthread_exit(NULL);
}

int main()
{
    cout<<"producing vector of size "<<num<<endl;
    vector<int> vtest(num);
    cout<<"produced  a vector of size "<<vtest.size()<<endl;
    pthread_t thread[procs];

    iterate** it=new iterate*[procs];
    int ans;
    void *status;

    cout<<"begining to thread through the vector\n";
    for (int i=0; i<procs; i++) {
        it[i]=new iterate(i, procs, (void *) &vtest);
    //  ans=pthread_create(&thread[i],NULL,viterate, (void *) it[i]);
    }
    for (int i=0; i<procs; i++) {
        pthread_join(thread[i], &status);
    }
    cout<<"end of threading through the vector";
    //reuse the iterate structures

    cout<<"producing a pointer with size "<<num<<endl;
    int * pint=new int[num];
    cout<<"produced a pointer with size "<<num<<endl;

    cout<<"begining to thread through the pointer\n";
    for (int i=0; i<procs; i++) {
        it[i]->member=&pint;
        ans=pthread_create(&thread[i], NULL, piterate, (void*) it[i]);
    }
    for (int i=0; i<procs; i++) {
        pthread_join(thread[i], &status);
    }
    cout<<"end of threading through the pointer\n";

    //delete structure array for iterate
    for (int i=0; i<procs; i++) {
        delete it[i];
    }
    delete [] it;

    //delete pointer
    delete [] pint;

    cout<<"end of the program"<<endl;
    return 0;
}

代码的行为表明向量的实例化是代码最长的部分。一旦你突破了这个瓶颈。其余代码运行得非常快。无论运行多少个线程，都是如此。

顺便说一下，忽略绝对疯狂的包含数量。我一直在使用这段代码来测试项目的内容，因此包含的数量不断增长。

By the way the slow down your seeing in classes using vector also occurs with standard types like int. Heres a multithreaded code:

#include <iostream>
#include <cstdio>
#include <map>
#include <string>
#include <typeinfo>
#include <vector>
#include <pthread.h>
#include <sstream>
#include <fstream>
using namespace std;

//pthread_mutex_t map_mutex=PTHREAD_MUTEX_INITIALIZER;

long long num=500000000;
int procs=1;

struct iterate
{
    int id;
    int num;
    void * member;
    iterate(int a, int b, void *c) : id(a), num(b), member(c) {}
};

//fill out viterate and piterate
void * viterate(void * input)
{
    printf("am in viterate\n");
    iterate * info=static_cast<iterate *> (input);
    // reproduce member type
    vector<int> test= *static_cast<vector<int>*> (info->member);
    for (int i=info->id; i<test.size(); i+=info->num)
    {
        //printf("am in viterate loop\n");
        test[i];
    }
    pthread_exit(NULL);
}

void * piterate(void * input)
{
    printf("am in piterate\n");
    iterate * info=static_cast<iterate *> (input);;
    int * test=static_cast<int *> (info->member);
    for (int i=info->id; i<num; i+=info->num) {
        //printf("am in piterate loop\n");
        test[i];
    }
    pthread_exit(NULL);
}

int main()
{
    cout<<"producing vector of size "<<num<<endl;
    vector<int> vtest(num);
    cout<<"produced  a vector of size "<<vtest.size()<<endl;
    pthread_t thread[procs];

    iterate** it=new iterate*[procs];
    int ans;
    void *status;

    cout<<"begining to thread through the vector\n";
    for (int i=0; i<procs; i++) {
        it[i]=new iterate(i, procs, (void *) &vtest);
    //  ans=pthread_create(&thread[i],NULL,viterate, (void *) it[i]);
    }
    for (int i=0; i<procs; i++) {
        pthread_join(thread[i], &status);
    }
    cout<<"end of threading through the vector";
    //reuse the iterate structures

    cout<<"producing a pointer with size "<<num<<endl;
    int * pint=new int[num];
    cout<<"produced a pointer with size "<<num<<endl;

    cout<<"begining to thread through the pointer\n";
    for (int i=0; i<procs; i++) {
        it[i]->member=&pint;
        ans=pthread_create(&thread[i], NULL, piterate, (void*) it[i]);
    }
    for (int i=0; i<procs; i++) {
        pthread_join(thread[i], &status);
    }
    cout<<"end of threading through the pointer\n";

    //delete structure array for iterate
    for (int i=0; i<procs; i++) {
        delete it[i];
    }
    delete [] it;

    //delete pointer
    delete [] pint;

    cout<<"end of the program"<<endl;
    return 0;
}

The behavior from the code shows the instantiation of vector is the longest part of the code. Once you get through that bottle neck. The rest of the code runs extremely fast. This is true no matter how many threads you are running on.

By the way ignore the absolutely insane number of includes. I have been using this code to test things for a project so the number of includes keep growing.

回复收藏 0 原文

仲春光 2024-09-25 03:49:46

我只想提一下，向量（和 smart_ptr）只是添加在原始数组（和原始指针）之上的薄层。
实际上，连续内存中向量的访问时间比数组更快。
以下代码显示了初始化和访问向量和数组的结果。

#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
#include <vector>
#define SIZE 20000
int main() {
    srand (time(NULL));
    vector<vector<int>> vector2d;
    vector2d.reserve(SIZE);
    int index(0);
    boost::posix_time::ptime start_total = boost::posix_time::microsec_clock::local_time();
    //  timer start - build + access
    for (int i = 0; i < SIZE; i++) {
        vector2d.push_back(vector<int>(SIZE));
    }
    boost::posix_time::ptime start_access = boost::posix_time::microsec_clock::local_time();
    //  timer start - access
    for (int i = 0; i < SIZE; i++) {
        index = rand()%SIZE;
        for (int j = 0; j < SIZE; j++) {

            vector2d[index][index]++;
        }
    }
    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
    boost::posix_time::time_duration msdiff = end - start_total;
    cout << "Vector total time: " << msdiff.total_milliseconds() << "milliseconds.\n";
    msdiff = end - start_acess;
    cout << "Vector access time: " << msdiff.total_milliseconds() << "milliseconds.\n"; 


    int index(0);
    int** raw2d = nullptr;
    raw2d = new int*[SIZE];
    start_total = boost::posix_time::microsec_clock::local_time();
    //  timer start - build + access
    for (int i = 0; i < SIZE; i++) {
        raw2d[i] = new int[SIZE];
    }
    start_access = boost::posix_time::microsec_clock::local_time();
    //  timer start - access
    for (int i = 0; i < SIZE; i++) {
        index = rand()%SIZE;
        for (int j = 0; j < SIZE; j++) {

            raw2d[index][index]++;
        }
    }
    end = boost::posix_time::microsec_clock::local_time();
    msdiff = end - start_total;
    cout << "Array total time: " << msdiff.total_milliseconds() << "milliseconds.\n";
    msdiff = end - start_acess;
    cout << "Array access time: " << msdiff.total_milliseconds() << "milliseconds.\n"; 
    for (int i = 0; i < SIZE; i++) {
        delete [] raw2d[i];
    }
    return 0;
}

输出是：

    Vector total time: 925milliseconds.
    Vector access time: 4milliseconds.
    Array total time: 30milliseconds.
    Array access time: 21milliseconds.

所以如果你使用得当，速度几乎是一样的。
（正如其他人提到的使用reserve()或resize()）。

I just want to mention that vector (and smart_ptr) is just a thin layer add on top of raw arrays (and raw pointers).
And actually the access time of an vector in continuous memory is faster than array.
The following code shows the result of initialize and access vector and array.

#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
#include <vector>
#define SIZE 20000
int main() {
    srand (time(NULL));
    vector<vector<int>> vector2d;
    vector2d.reserve(SIZE);
    int index(0);
    boost::posix_time::ptime start_total = boost::posix_time::microsec_clock::local_time();
    //  timer start - build + access
    for (int i = 0; i < SIZE; i++) {
        vector2d.push_back(vector<int>(SIZE));
    }
    boost::posix_time::ptime start_access = boost::posix_time::microsec_clock::local_time();
    //  timer start - access
    for (int i = 0; i < SIZE; i++) {
        index = rand()%SIZE;
        for (int j = 0; j < SIZE; j++) {

            vector2d[index][index]++;
        }
    }
    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
    boost::posix_time::time_duration msdiff = end - start_total;
    cout << "Vector total time: " << msdiff.total_milliseconds() << "milliseconds.\n";
    msdiff = end - start_acess;
    cout << "Vector access time: " << msdiff.total_milliseconds() << "milliseconds.\n"; 


    int index(0);
    int** raw2d = nullptr;
    raw2d = new int*[SIZE];
    start_total = boost::posix_time::microsec_clock::local_time();
    //  timer start - build + access
    for (int i = 0; i < SIZE; i++) {
        raw2d[i] = new int[SIZE];
    }
    start_access = boost::posix_time::microsec_clock::local_time();
    //  timer start - access
    for (int i = 0; i < SIZE; i++) {
        index = rand()%SIZE;
        for (int j = 0; j < SIZE; j++) {

            raw2d[index][index]++;
        }
    }
    end = boost::posix_time::microsec_clock::local_time();
    msdiff = end - start_total;
    cout << "Array total time: " << msdiff.total_milliseconds() << "milliseconds.\n";
    msdiff = end - start_acess;
    cout << "Array access time: " << msdiff.total_milliseconds() << "milliseconds.\n"; 
    for (int i = 0; i < SIZE; i++) {
        delete [] raw2d[i];
    }
    return 0;
}

The output is:

    Vector total time: 925milliseconds.
    Vector access time: 4milliseconds.
    Array total time: 30milliseconds.
    Array access time: 21milliseconds.

So the speed will be almost the same if you use it properly.
(as others mentioned using reserve() or resize()).

回复收藏 0 原文

残月升风 2024-09-25 03:49:46

嗯，因为 vector::resize() 比普通内存分配（通过 malloc）执行更多处理。

尝试在复制构造函数中放置一个断点（定义它以便可以断点！），这会增加额外的处理时间。

回复收藏 0 原文

恍梦境° 2024-09-25 03:49:46

我不得不说我不是 C++ 专家。但要添加一些实验结果：

编译：
gcc-6.2.0/bin/g++ -O3 -std=c++14 vector.cpp

machine:

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

OS:

2.6.32-642.13.1.el6.x86_64

Output:

UseArray completed in 0.167821 seconds
UseVector completed in 0.134402 seconds
UseConstructor completed in 0.134806 seconds
UseFillConstructor completed in 1.00279 seconds
UseVectorPushBack completed in 6.6887 seconds
The whole thing completed in 8.12888 seconds

这里唯一让我感到奇怪的是“UseFillConstructor”性能与“UseConstructor”相比。

代码：

void UseConstructor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension*dimension);
        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}


void UseFillConstructor()
{
    TestTimer t("UseFillConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension*dimension, Pixel(255,0,0));
    }
}

因此，提供的额外“值”会大大降低性能，我认为这是由于多次调用复制构造函数所致。但是...

编译：

gcc-6.2.0/bin/g++ -std=c++14 -O vector.cpp

输出：

UseArray completed in 1.02464 seconds
UseVector completed in 1.31056 seconds
UseConstructor completed in 1.47413 seconds
UseFillConstructor completed in 1.01555 seconds
UseVectorPushBack completed in 6.9597 seconds
The whole thing completed in 11.7851 seconds

所以在这种情况下，gcc 优化非常重要，但当提供默认值时，它对您没有多大帮助。这，其实是抵扣我的学费的。希望它可以帮助新程序员选择哪种向量初始化格式。

I Have to say I'm not an expert in C++. But to add some experiments results:

compile:
gcc-6.2.0/bin/g++ -O3 -std=c++14 vector.cpp

machine:

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

OS:

2.6.32-642.13.1.el6.x86_64

Output:

UseArray completed in 0.167821 seconds
UseVector completed in 0.134402 seconds
UseConstructor completed in 0.134806 seconds
UseFillConstructor completed in 1.00279 seconds
UseVectorPushBack completed in 6.6887 seconds
The whole thing completed in 8.12888 seconds

Here the only thing I feel strange is that "UseFillConstructor" performance compared with "UseConstructor".

The code:

void UseConstructor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension*dimension);
        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}


void UseFillConstructor()
{
    TestTimer t("UseFillConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension*dimension, Pixel(255,0,0));
    }
}

So the additional "value" provided slows down performance quite a lot, which I think is due to multiple call to copy constructor. But...

Compile:

gcc-6.2.0/bin/g++ -std=c++14 -O vector.cpp

Output:

UseArray completed in 1.02464 seconds
UseVector completed in 1.31056 seconds
UseConstructor completed in 1.47413 seconds
UseFillConstructor completed in 1.01555 seconds
UseVectorPushBack completed in 6.9597 seconds
The whole thing completed in 11.7851 seconds

So in this case, gcc optimization is very important but it can't help you much when a value is provided as default. This, is against my tuition actually. Hopefully it helps new programmer when choose which vector initialization format.

回复收藏 0 原文

隐诗 2024-09-25 03:49:46

它似乎取决于编译器标志。这是一个基准代码：

#include <chrono>
#include <cmath>
#include <ctime>
#include <iostream>
#include <vector>


int main(){

    int size = 1000000; // reduce this number in case your program crashes
    int L = 10;

    std::cout << "size=" << size << " L=" << L << std::endl;
    {
        srand( time(0) );
        double * data = new double[size];
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of C style heap array:    " << duration << "ms\n";
        delete data;
    }

    {
        srand( 1 + time(0) );
        double data[size]; // technically, non-compliant with C++ standard.
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of C99 style stack array: " << duration << "ms\n";
    }

    {
        srand( 2 + time(0) );
        std::vector<double> data( size );
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of std::vector array:     " << duration << "ms\n";
    }

    return 0;
}

不同的优化标志给出不同的答案：

$ g++ -O0 benchmark.cpp 
$ ./a.out 
size=1000000 L=10
Calculation result is 181182
Duration of C style heap array:    118441ms
Calculation result is 181240
Duration of C99 style stack array: 104920ms
Calculation result is 181210
Duration of std::vector array:     124477ms
$g++ -O3 benchmark.cpp
$ ./a.out 
size=1000000 L=10
Calculation result is 181213
Duration of C style heap array:    107803ms
Calculation result is 181198
Duration of C99 style stack array: 87247ms
Calculation result is 181204
Duration of std::vector array:     89083ms
$ g++ -Ofast benchmark.cpp 
$ ./a.out 
size=1000000 L=10
Calculation result is 181164
Duration of C style heap array:    93530ms
Calculation result is 181179
Duration of C99 style stack array: 80620ms
Calculation result is 181191
Duration of std::vector array:     78830ms

您的确切结果会有所不同，但这在我的机器上非常典型。

It seems to depend on the compiler flags. Here is a benchmark code:

#include <chrono>
#include <cmath>
#include <ctime>
#include <iostream>
#include <vector>


int main(){

    int size = 1000000; // reduce this number in case your program crashes
    int L = 10;

    std::cout << "size=" << size << " L=" << L << std::endl;
    {
        srand( time(0) );
        double * data = new double[size];
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of C style heap array:    " << duration << "ms\n";
        delete data;
    }

    {
        srand( 1 + time(0) );
        double data[size]; // technically, non-compliant with C++ standard.
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of C99 style stack array: " << duration << "ms\n";
    }

    {
        srand( 2 + time(0) );
        std::vector<double> data( size );
        double result = 0.;
        std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
        for( int l = 0; l < L; l++ ) {
            for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
            for( int i = 0; i < size; i++ ) result += data[i] * data[i];
        }
        std::chrono::steady_clock::time_point end   = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
        std::cout << "Calculation result is " << sqrt(result) << "\n";
        std::cout << "Duration of std::vector array:     " << duration << "ms\n";
    }

    return 0;
}

Different optimization flags give different answers:

$ g++ -O0 benchmark.cpp 
$ ./a.out 
size=1000000 L=10
Calculation result is 181182
Duration of C style heap array:    118441ms
Calculation result is 181240
Duration of C99 style stack array: 104920ms
Calculation result is 181210
Duration of std::vector array:     124477ms
$g++ -O3 benchmark.cpp
$ ./a.out 
size=1000000 L=10
Calculation result is 181213
Duration of C style heap array:    107803ms
Calculation result is 181198
Duration of C99 style stack array: 87247ms
Calculation result is 181204
Duration of std::vector array:     89083ms
$ g++ -Ofast benchmark.cpp 
$ ./a.out 
size=1000000 L=10
Calculation result is 181164
Duration of C style heap array:    93530ms
Calculation result is 181179
Duration of C99 style stack array: 80620ms
Calculation result is 181191
Duration of std::vector array:     78830ms

Your exact results will vary but this is quite typical on my machine.

回复收藏 0 原文