性能：类的向量或包含向量的类

发布于 2024-07-17 23:49:38 字数 1496 浏览 1 评论 0原文

我有一个包含许多双精度值的类。它存储在一个向量中，其中类的索引很重要（它们是从其他地方引用的）。该类看起来像这样：

类的向量

class A
{
  double count;
  double val;
  double sumA;
  double sumB;

  vector<double> sumVectorC;
  vector<double> sumVectorD;
}

vector<A> classes(10000);

需要尽可能快地运行的代码是这样的：

vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
  result[i] += classes[i].sumA;
  vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
  if(it != classes[i].sumVectorC.end())
    result[i] += *it;
}

另一种方法是将计算分成两个单独的循环，而不是一个巨大的循环，例如：

for(int i = 0; i < classes.size(); i++)
{
  result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
 vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
  if(it != classes[i].sumVectorC.end())
    result[i] += *it;
}

或存储每个成员向量中的类如下：

向量类

vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...

，然后操作为：

for(int i = 0; i < classes.size(); i++)
{
  result[i] += classCounts[i];
  ...
}

哪种方式通常会更快（跨 x86/x64 平台和编译器）？前瞻和缓存行是这里要考虑的最重要的事情吗？

更新

我在这里进行线性搜索（即查找）而不是哈希映射或二分搜索的原因是因为 sumVector 非常短，大约有 4 或 5 个元素。分析显示哈希映射速度较慢，二分搜索速度稍慢。

原文

I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:

Vector of classes

class A
{
  double count;
  double val;
  double sumA;
  double sumB;

  vector<double> sumVectorC;
  vector<double> sumVectorD;
}

vector<A> classes(10000);

The code that needs to run as fast as possible is something like this:

vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
  result[i] += classes[i].sumA;
  vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
  if(it != classes[i].sumVectorC.end())
    result[i] += *it;
}

The alternative is instead of one giant loop, split the computation into two separate loops such as:

for(int i = 0; i < classes.size(); i++)
{
  result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
 vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
  if(it != classes[i].sumVectorC.end())
    result[i] += *it;
}

or to store each member of the class in a vector like so:

Class of vectors

vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...

and then operate as:

for(int i = 0; i < classes.size(); i++)
{
  result[i] += classCounts[i];
  ...
}

Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?

Update

The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

聚集的泪 2024-07-24 23:49:38

由于这两种变体的实现似乎都很容易，我将构建这两个版本并对它们进行分析以找到最快的版本。

经验数据通常胜过猜测。

回复收藏 0 原文

喵星人汪星人 2024-07-24 23:49:38

附带问题：目前，最内层循环中的 find() 对 classes[i].sumVectorC 的所有元素进行线性扫描，直到找到匹配的值。如果该向量包含许多值，并且您没有理由相信 testVal 出现在向量的开头附近，那么这会很慢 - 考虑使用具有更快查找速度的容器类型（例如 < code>std::map 或非标准但常用的 hash_map 类型之一）。

作为一般准则：在低级实现优化之前考虑算法改进。

回复收藏 0 原文

千鲤 2024-07-24 23:49:38

正如洛萨所说，你真的应该测试一下。但要回答你的最后一个问题，是的，缓存未命中将是这里的一个主要问题。

另外，您的第一个实现似乎会遇到编码时的加载命中存储停顿，但我不确定 x86 上有多少问题（这在 Xbox 360 和 PS3 上是一个大问题）。

回复收藏 0 原文

烧了回忆取暖 2024-07-24 23:49:38

看起来优化 find() 将是一个巨大的胜利（配置文件可以确定）。根据不同的大小，除了用另一个容器替换向量之外，您可以尝试对 sumVectorC 进行排序并使用 lower_bound 形式的二分搜索。这会将您的线性搜索 O(n) 变为 O(log n)。

回复收藏 0 原文

绻影浮沉 2024-07-24 23:49:38

如果您可以保证 std::numeric_limits::infinity 不是一个可能的值，请确保数组在末尾使用虚拟无限条目进行排序，然后手动对查找进行编码，以便循环条件是单个测试：

 array[i]<test_val

然后是相等测试。

那么你知道在未找到的情况下，查看值的平均数量是 (size()+1)/2。当然，如果搜索数组变化非常频繁，那么保持其排序的问题就是一个问题。

当然，您没有告诉我们太多有关 sumVectorC 或 A 的其余部分的信息，因此很难确定并给出真正好的建议。例如，如果 sumVectorC 永远不会更新，那么可能会找到一个非常便宜的哈希值（例如强制转换 ULL 和位提取），该哈希值非常适合适合 double[8] 的 sumVectorC 值。那么开销是位提取和 1 与 3 或 6 的比较

此外，如果您对 sumVectorC.size() 有一个合理的界限（您提到了 4 或 5，所以这个假设似乎不错），您可以考虑使用聚合数组，甚至只需一个 boost::array 并添加您自己的动态大小，例如：

class AggregatedArray : public boost::array<double>{
   size_t _size;
   size_t size() const {
      return size;
   }
   ....
   push_back(..){...
   pop(){...
   resize(...){...
};

这消除了对 sumVectorC 分配的数组数据的额外缓存行访问。

在 sumVectorC 很少更新的情况下，如果找到一个完美的哈希（在你的哈希算法类之外）相对便宜，那么当 sumVectorC 改变时你可以带来利润。这些小的查找可能会出现问题，并且算法的复杂性通常是无关紧要的 - 占主导地位的是常量。这是一个工程问题，而不是一个理论问题。

除非你能保证小映射在缓存中，否则几乎可以保证使用 std::map 会产生大约 130% 的性能差，因为树中的每个节点几乎都位于单独的缓存行中，

因此而不是访问(4 × 1+1 × 2)/5 = 每次搜索 1.2 个缓存行（前 4 个在第一个缓存行中，第 5 个在第二个缓存行中，您将访问 (1 + 2 × 2 + 2 × 3) = 9/ 5) + 1 树本身 = 每次搜索 2.8 个缓存行（1 是根处的 1 个节点，2 个节点是根的子节点，最后 2 个是根的孙子节点，加上树本身）

所以我预测对于具有 5 个条目的 sumVectorC，使用 std::map 需要 2.8/1.2 = 233%

这就是我说的意思：“这是一个工程问题，而不是理论问题。”

if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:

 array[i]<test_val

and then an equality test.

then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.

of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6

Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :

class AggregatedArray : public boost::array<double>{
   size_t _size;
   size_t size() const {
      return size;
   }
   ....
   push_back(..){...
   pop(){...
   resize(...){...
};

this gets rid of the extra cache line access to the allocated array data for sumVectorC.

In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.

Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line

Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)

So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries

This what I meant when I said: "It is an engineering problem and not a theoretical one."

回复收藏 0 原文

~没有更多了~