如何检测数据集中的所有空列并删除\删除它们?

发布于 2024-10-28 03:38:55 字数 783 浏览 1 评论 0原文

正如标题中所建议的,我想删除所有空列\变量(其中所有记录均为空或等于 null 或“”),以减少以后执行的时间成本。

详细场景:

我有一个包含 1000 列的 dataset(),其中一些\很多是空的。现在我想创建一个新的数据集,其中需要在先前数据集的某些条件下添加列。

data new;

   set old;

   if oldcol1 ne "" then newcol1='<a>'||strip(oldcol1)||'</a>';

   end;

   if oldcol2 ne "" then newcol2='<a>'||strip(oldcol2)||'</a>';

   end;

   ...

   ...;

   drop oldcol1 oldcol2.....oldcol1000;
   run;

由于以下原因,执行需要相当长的时间:

  1. 旧列的数量巨大

  2. 事实上,我需要在另一个数据集中执行循环来设置之后的数字oldcol

ColNumber

 1

 2

 3

之后的数字...

1000

所以你可以想象,搜索、查找、设置值要执行多少次。

因此,我能想到的减少时间成本的一种方法是首先删除所有空列。但任何有关优化算法的意见也受到高度欢迎。

谢谢

As suggested in the title, I'd like to drop all empty columns\variables(where all records are empty or equal null or ""), so as to reduce time cost in later execution.

Detailed scenario:

I have a dataset() with 1000 columns, some\lots of which are empty. Now I want to create a new dataset in which I need to add columns under some conditions of previous dataset.

data new;

   set old;

   if oldcol1 ne "" then newcol1='<a>'||strip(oldcol1)||'</a>';

   end;

   if oldcol2 ne "" then newcol2='<a>'||strip(oldcol2)||'</a>';

   end;

   ...

   ...;

   drop oldcol1 oldcol2.....oldcol1000;
   run;

It takes quite some time to execute given the following reason:

  1. number of old columns is huge

  2. in fact I need to do a loop in another dataset to set the number after oldcol

ColNumber

 1

 2

 3

...

1000

So you can imagine how many times to be executed in terms of searching, finding and setting values.

Hence one way I could think of to reduce time cost is drop all empty columns first. But any inputs regarding optimizing the algorithm is highly welcomed as well.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

野味少女 2024-11-04 03:38:55

下面是一个通用宏,您可以使用它来生成源数据集中的空列列表,然后可以将其传递给 drop 语句。它使用 proc 格式和 proc freq,因此速度相对较快。

%macro findmiss(ds,macvar);
%local noteopt;
%let noteopt=%sysfunc(getoption(notes));
option nonotes;
*ds is the data set to parse for missing values;
*macvar is the macro variable that will store the list of empty columns;
%global &macvar; 
proc format;
  value nmis  .-.z =' ' other='1';
  value $nmis ' '=' ' other='1';
run;
ods listing close;
ods output OneWayFreqs=OneValue(
  where=(frequency=cumfrequency 
  AND CumPercent=100));

proc freq data=&ds;
  table _All_ / Missing ;
  format _numeric_ nmis. 
        _character_ $nmis.;
  run;
ods listing;
data missing(keep=var);
  length var $32.;
  set OneValue end=eof;
    if percent eq 100 AND sum(of F_:) < 1 ;
    var = scan(Table,-1,' ');
run;
proc sql noprint;
  select var into: &macvar separated by " "
  from missing;quit;
option ¬eopt.;
%mend;

以下是您可以如何使用它:

%findmiss(old,droplist); /*generate the list of empty columns */
data new;
  set old(drop=&droplist);
run;

Here's a generic macro that you can use to generate a list of the empty columns in the source data set, which you can then pass to a drop statement. It uses proc format and proc freq so it is relatively fast.

%macro findmiss(ds,macvar);
%local noteopt;
%let noteopt=%sysfunc(getoption(notes));
option nonotes;
*ds is the data set to parse for missing values;
*macvar is the macro variable that will store the list of empty columns;
%global &macvar; 
proc format;
  value nmis  .-.z =' ' other='1';
  value $nmis ' '=' ' other='1';
run;
ods listing close;
ods output OneWayFreqs=OneValue(
  where=(frequency=cumfrequency 
  AND CumPercent=100));

proc freq data=&ds;
  table _All_ / Missing ;
  format _numeric_ nmis. 
        _character_ $nmis.;
  run;
ods listing;
data missing(keep=var);
  length var $32.;
  set OneValue end=eof;
    if percent eq 100 AND sum(of F_:) < 1 ;
    var = scan(Table,-1,' ');
run;
proc sql noprint;
  select var into: &macvar separated by " "
  from missing;quit;
option ¬eopt.;
%mend;

Here is how you might use it:

%findmiss(old,droplist); /*generate the list of empty columns */
data new;
  set old(drop=&droplist);
run;
七颜 2024-11-04 03:38:55

我同意 proc 转置是一个好主意:

proc transpose data=old out=temp; 
var _ALL_;
run;

data _NULL_;
set temp end=eof;
    array cols {*} COL: ;
do i = 1 to dim(cols);
    cols[i]=ifn((strip(cols[i])=" " or strip(cols[i])="."),0,1);
end;
if sum(of COL:)=0 then 
call symput("dropvars", catx(" ",symget("dropvars"),_NAME_));
run;

    data new; set old (drop=&dropvars); run;

I agree that proc transpose is a good idea:

proc transpose data=old out=temp; 
var _ALL_;
run;

data _NULL_;
set temp end=eof;
    array cols {*} COL: ;
do i = 1 to dim(cols);
    cols[i]=ifn((strip(cols[i])=" " or strip(cols[i])="."),0,1);
end;
if sum(of COL:)=0 then 
call symput("dropvars", catx(" ",symget("dropvars"),_NAME_));
run;

    data new; set old (drop=&dropvars); run;
·深蓝 2024-11-04 03:38:55

像这样的东西吗?

data work.temp1;
  attrib idcol length=8;
  set work.old;

  idcol=_n_;
run;

proc transpose data=work.temp1 out=work.temp2 name=varname;
  var oldcol1-oldcol1000;
  by idcol;
run;

proc sql;
  create table work.temp3 as 
    select distinct varname from work.temp2 where not missing(col1);
quit;

data _null_;
  set work.temp3 end=lastrec;

  attrib nvarname length=$32;

  if _n_=1 then do;
    call execute('data work.new;');
    call execute('set work.old;');
  end;

  nvarname = 'newcol' || strip(input(substr(varname,4),4.));
  call execute('attrib ' || strip(nvarname) || ' length=$250;');

  call execute(strip(nvarname) || '= "<a>" || strip(' || strip(varname) || ') || "</a>";' );

  if lastrec then do;
    call execute('drop oldcol1-oldcol1000;');
    call execute('run;');
  end;
run;

Something like this?

data work.temp1;
  attrib idcol length=8;
  set work.old;

  idcol=_n_;
run;

proc transpose data=work.temp1 out=work.temp2 name=varname;
  var oldcol1-oldcol1000;
  by idcol;
run;

proc sql;
  create table work.temp3 as 
    select distinct varname from work.temp2 where not missing(col1);
quit;

data _null_;
  set work.temp3 end=lastrec;

  attrib nvarname length=$32;

  if _n_=1 then do;
    call execute('data work.new;');
    call execute('set work.old;');
  end;

  nvarname = 'newcol' || strip(input(substr(varname,4),4.));
  call execute('attrib ' || strip(nvarname) || ' length=$250;');

  call execute(strip(nvarname) || '= "<a>" || strip(' || strip(varname) || ') || "</a>";' );

  if lastrec then do;
    call execute('drop oldcol1-oldcol1000;');
    call execute('run;');
  end;
run;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文