无论如何,在决策树中与2D数组打交道时是否有避免使用“ memcpy”?
我正在尝试编码二进制决策树。以下代码是一个可作为整个算法的一部分的数据分配的工作示例,尽管它具有一些 error 由Valgrind检测到。
您可能已经知道,这棵树由节点组成。在每个节点中,它通过分裂条件携带所有合格的观测值(行)。继续继续,分裂也依赖于这些行。
并且可以通过所有变量和值从贪婪的搜索中找到最佳的分裂条件,这意味着该数据分割步骤可以执行数百次以找到最佳分裂(通过Gini Index在分类问题中评估)。我认为使用memcpy
时可能会很昂贵。最糟糕的部分是,每次分开尝试后,我必须释放复制行的内存。
因此,我在想是否有一种方法可以避免使用memcpy
从字面上复制整个数据集,而只需在每个分割中引用原始数据集的合格行。这样,不需要过多的memcpy
和免费
。
问题:要针对以下示例特定,如何使用一个指针记录合格行的地址,以便不需要memcpy
,并允许(易于或方便)执行合格行贪婪的搜索以进行最佳分裂。
让我知道您的问题是否对您不清楚。非常感谢您的时间。
欢迎任何帮助和建议。
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void printArray(double array[], unsigned int size) {
for (unsigned int i = 0; i < size; ++i) {
printf("%.3f ", array[i]);
}
printf("\n");
}
double **alloc_2dArray(unsigned int row, unsigned int col)
{
double **Array = (double **) malloc(row * sizeof(double *));
for (unsigned int i = 0; i < row; i++) {
Array[i] = (double *) malloc(col * sizeof(double));
}
return Array;
}
void dealloc_2dArray(double **Array, unsigned int row, unsigned int col)
{
for (unsigned int i = 0; i < row; i++)
{
free(Array[i]);
}
free(Array);
}
int main(){
double **data = alloc_2dArray(4,4);
double delta[4] = {1,0,1,0};
double **dataL = alloc_2dArray(4,4);
double **dataR = alloc_2dArray(4,4);
int i,j;
int k=0;
//data initialization
for(i=0; i<4; i++){
for(j=0; j<4; j++){
data[i][j] = k++;
}
}
int l=0, r=0;
for(i=0; i<4; i++){
if((int)delta[i] ==0){
memcpy(dataL[l++], data[i], 4*sizeof(double));
}else{
memcpy(dataR[r++], data[i], 4*sizeof(double));
}
}
for(i = l; i<4; i++){
free(dataL[i]);
}
printf("l and r are %d %d\n", l, r);
dataL = realloc(dataL, l*sizeof(double*));
for(i = r; i<4; i++){
free(dataR[i]);
}
dataR = realloc(dataR, r*sizeof(double*));
for(i=0; i<4; i++){
if(dataL[i] != NULL){
printArray(dataL[i],4);
}
if(dataR[i] != NULL){
printArray(dataR[i],4);
}
}
dealloc_2dArray(data, 4,4);
dealloc_2dArray(dataL, l, 4);
dealloc_2dArray(dataR, r, 4);
return 0;
}
I'm trying to code binary decision tree. The below code is a working example for data splitting as a part of the whole algorithm although it has some error detected by Valgrind.
As you may already know, the tree is composed of nodes. In every node, it carries all the qualified observations(rows) by the splitting condition. Continue on, the splitting relies on those rows as well.
And the best split condition can be found from a greedy search through all the variables and values, which means this data splitting step could execute hundreds of times to find the optimal splitting(evaluate by Gini index in classification problem). I thought it could be expensive when using memcpy
. The worst part is I have to free the memory for the copied rows after each splitting attempt.
So, I was thinking if there is a way to avoid using memcpy
to copy the entire dataset literally, instead, just reference the qualified rows of the original dataset in each splitting. In this way, no excessive memcpy
and free
are needed.
Question: To be specific for the below example, how to use one pointer to record the address of the qualified rows so that no memcpy
is needed and allow (easy or handy) access to qualified rows when performing a greedy search for optimal splitting.
Let me know if the question is not clear to you. Thanks so much for your time.
Any help and suggestion are welcome.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void printArray(double array[], unsigned int size) {
for (unsigned int i = 0; i < size; ++i) {
printf("%.3f ", array[i]);
}
printf("\n");
}
double **alloc_2dArray(unsigned int row, unsigned int col)
{
double **Array = (double **) malloc(row * sizeof(double *));
for (unsigned int i = 0; i < row; i++) {
Array[i] = (double *) malloc(col * sizeof(double));
}
return Array;
}
void dealloc_2dArray(double **Array, unsigned int row, unsigned int col)
{
for (unsigned int i = 0; i < row; i++)
{
free(Array[i]);
}
free(Array);
}
int main(){
double **data = alloc_2dArray(4,4);
double delta[4] = {1,0,1,0};
double **dataL = alloc_2dArray(4,4);
double **dataR = alloc_2dArray(4,4);
int i,j;
int k=0;
//data initialization
for(i=0; i<4; i++){
for(j=0; j<4; j++){
data[i][j] = k++;
}
}
int l=0, r=0;
for(i=0; i<4; i++){
if((int)delta[i] ==0){
memcpy(dataL[l++], data[i], 4*sizeof(double));
}else{
memcpy(dataR[r++], data[i], 4*sizeof(double));
}
}
for(i = l; i<4; i++){
free(dataL[i]);
}
printf("l and r are %d %d\n", l, r);
dataL = realloc(dataL, l*sizeof(double*));
for(i = r; i<4; i++){
free(dataR[i]);
}
dataR = realloc(dataR, r*sizeof(double*));
for(i=0; i<4; i++){
if(dataL[i] != NULL){
printArray(dataL[i],4);
}
if(dataR[i] != NULL){
printArray(dataR[i],4);
}
}
dealloc_2dArray(data, 4,4);
dealloc_2dArray(dataL, l, 4);
dealloc_2dArray(dataR, r, 4);
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我 think 您询问您是否只能将原始数据行指针保留在拆分矩阵中。答案是肯定的,你可以。但是,所有权发挥了作用。
对于下面的示例(没有
memcpy
操作)datal
和datar
datar 指针床是直接分配的,而没有分配的数据行的实际行给他们。这些行是从原始数据矩阵中“借”的。输出
关于这些行指针的实际所有权。只有您知道那里有什么合适的。如果原始
数据
矩阵要继续拥有它们,则必须注意确保data
及其行datal
和Datar
仍在使用中。同样,如果datar
和datal
是所有权,则在释放data
时,您不应释放这些行data
),并具有datal
和datar
free'ing不仅要释放其基本指针床,还可以释放行。无论如何,你去了。
I think you're asking whether you can just retain the original data row pointers in the split matrices. The answer is yes, you can. Ownership comes into play, however.
For the example below (which has no
memcpy
operations) both thedataL
anddataR
pointer beds are direct-allocated with no actual rows of data assigned to them. The rows are "borrowed" from the original data matrix.Output
Regarding actual ownership of those row pointers. Only you know what is appropriate there. If the original
data
matrix is to continue to own them, you must take care to ensure thatdata
and its rows are not destroyed whiledataL
anddataR
are still in use. Likewise, ifdataR
anddataL
are to take ownership, then you should not free those rows when freeingdata
(but still free the base pointer bed ofdata
), and havedataL
anddataR
free'ing be responsible for freeing not only their base pointer beds, but the rows as well.Anyway, there you go.