C++ 中缓存对齐内存使用的类模板
(提供理解我的问题所需的信息很多,但是它已经被压缩)
我尝试实现一个类模板来分配和访问对齐的数据缓存。这非常有效,但是尝试实现对数组的支持是一个问题。
从语义上讲,代码应在内存中为单个元素提供这种映射,如下
cache_aligned<element_type>* my_el =
new(cache_line_size) cache_aligned<element_type>();
| element | buffer |
所示:访问(到目前为止)如下所示:
*my_el; // returns cache_aligned<element_type>
**my_el; //returns element_type
*my_el->member_of_element();
但是对于数组,我想要这样:
cache_aligned<element_type>* my_el_array =
new(cache_line_size) cache_aligned<element_type()[N];
| element 0 | buffer | element 1 | buffer | ... | element (N-1) | buffer |
到目前为止,我有以下代码
template <typename T>
class cache_aligned {
private:
T instance;
public:
cache_aligned()
{}
cache_aligned(const T& other)
:instance(other.instance)
{}
static void* operator new (size_t size, uint c_line_size) {
return c_a_malloc(size, c_line_size);
}
static void* operator new[] (size_t size, uint c_line_size) {
int num_el = (size - sizeof(cache_aligned<T>*)
/ sizeof(cache_aligned<T>);
return c_a_array(sizeof(cache_aligned<T>), num_el, c_line_size);
}
static void operator delete (void* ptr) {
free_c_a(ptr);
}
T* operator-> () {
return &instance;
}
T& operator * () {
return instance;
}
};
:函数
void* c_a_array(uint size, ulong num_el, uint c_line_size) {
void* mem = malloc((size + c_line_size) * num_el + sizeof(void*));
void** ptr = (void**)((long)mem + sizeof(void*));
ptr[-1] = mem;
return ptr;
}
void free_c_a(void ptr) {
free(((void**)ptr)[-1]);
}
cache_aligned_malloc问题就在这里,对数据的访问应该像这样工作:
my_el_array[i]; // returns cache_aligned<element_type>
*(my_el_array[i]); // returns element_type
my_el_array[i]->member_of_element();
我解决它的想法是:
(1)与此类似的东西,以重载 sizeof 运算符:
static size_t operator sizeof () {
return sizeof(cache_aligned<T>) + c_line_size;
}
-->不可能,因为重载 sizeof 运算符是非法的
(2) 像这样,重载指针类型的运算符 []:
static T& operator [] (uint index, cache_aligned<T>* ptr) {
return ptr + ((sizeof(cache_aligned<T>) + c_line_size) * index);
}
-->无论如何,在 C++ 中是不可能的
(3) 完全微不足道的解决方案
template <typename T> cache_aligned {
private:
T instance;
bool buffer[CACHE_LINE_SIZE];
// CACHE_LINE_SIZE defined as macro
public:
// trivial operators and methods ;)
};
-->我不知道这是否可靠,实际上我在linux中使用gcc-4.5.1...
(4)替换T实例;通过 T* instance_ptr;在类模板中并使用运算符 [] 来计算元素的位置,如下所示:
|指向实例的指针 | ----> |元素 0 |缓冲| ... |元素 (N-1) |缓冲|
这不是预期的语义,因为类模板的实例成为计算元素地址时的瓶颈。
感谢您的阅读!我不知道如何缩短这个问题。如果您能提供帮助,那就太好了!任何解决办法都会有很大帮助。
我知道对齐是 C++0x 中的扩展。然而,在 gcc 中它还不可用。
问候,塞玛
(to provide the information you need to understand my question is a lot, however it is already compressed)
i try to implement a class template to allocate and access data cache aligned. This works very good, however trying to implement support for arrays is a problem.
Semantically the code shall provide this mapping in memory for a single element like this:
cache_aligned<element_type>* my_el =
new(cache_line_size) cache_aligned<element_type>();
| element | buffer |
the access (so far) looks like this:
*my_el; // returns cache_aligned<element_type>
**my_el; //returns element_type
*my_el->member_of_element();
HOWEVER for an array, i'd like to have this:
cache_aligned<element_type>* my_el_array =
new(cache_line_size) cache_aligned<element_type()[N];
| element 0 | buffer | element 1 | buffer | ... | element (N-1) | buffer |
So far i have the following code
template <typename T>
class cache_aligned {
private:
T instance;
public:
cache_aligned()
{}
cache_aligned(const T& other)
:instance(other.instance)
{}
static void* operator new (size_t size, uint c_line_size) {
return c_a_malloc(size, c_line_size);
}
static void* operator new[] (size_t size, uint c_line_size) {
int num_el = (size - sizeof(cache_aligned<T>*)
/ sizeof(cache_aligned<T>);
return c_a_array(sizeof(cache_aligned<T>), num_el, c_line_size);
}
static void operator delete (void* ptr) {
free_c_a(ptr);
}
T* operator-> () {
return &instance;
}
T& operator * () {
return instance;
}
};
the functions cache_aligned_malloc
void* c_a_array(uint size, ulong num_el, uint c_line_size) {
void* mem = malloc((size + c_line_size) * num_el + sizeof(void*));
void** ptr = (void**)((long)mem + sizeof(void*));
ptr[-1] = mem;
return ptr;
}
void free_c_a(void ptr) {
free(((void**)ptr)[-1]);
}
The problem is here, the access to the data should work like this:
my_el_array[i]; // returns cache_aligned<element_type>
*(my_el_array[i]); // returns element_type
my_el_array[i]->member_of_element();
My ideas to solve it, are:
(1) something similar to this, to overload sizeof operator:
static size_t operator sizeof () {
return sizeof(cache_aligned<T>) + c_line_size;
}
--> not possible since overloading sizeof operator is illegal
(2) something like this, to overload the operator [] for the pointer type:
static T& operator [] (uint index, cache_aligned<T>* ptr) {
return ptr + ((sizeof(cache_aligned<T>) + c_line_size) * index);
}
--> not possible in C++, anyway
(3) totally trivial solution
template <typename T> cache_aligned {
private:
T instance;
bool buffer[CACHE_LINE_SIZE];
// CACHE_LINE_SIZE defined as macro
public:
// trivial operators and methods ;)
};
--> i don't know whether this is reliable, actually i'm using gcc-4.5.1 in linux ...
(4) Replacing T instance; by T* instance_ptr; in the class template and using the operator [] to calculate the position of the element, like this:
| pointer-to-instance | ----> | element 0 | buffer | ... | element (N-1) | buffer |
this is not the intended semantic, since the instance of the class template becomes the bottleneck when calculating the address of the elements.
Thanks for reading! I dont' know how to shorten the problem. It would be great, if you can help! Any work around would help a lot.
I know alignment is an extension in C++0x. However, in gcc it is not available yet.
Greetz, sema
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当 c_line_size 是编译时积分常量时,当然最好根据 sizeof T 用 char 数组填充 cache_aligned。
您还可以检查 2 个 T 是否适合一个缓存行,并相应地降低对齐要求。
不要指望这样的优化会产生奇迹。我认为对于某些算法来说,提高 2 倍的性能是您可以通过避免缓存行拆分来挤出的上限。
When c_line_size is compile time integral constant then of course better pad the cache_aligned with char array depending on sizeof T.
You can also check if 2 T-s fit onto one cache line and lower the alignment requirement accordingly.
Do not expect miracles from such an optimization. I think 2 times better performance for some algorithms is the ceiling that you can squeeze out from avoiding cache line splits.