如何在 SAS? 中删除重复记录\观察而不进行排序;

发布于 2024-11-02 03:39:17 字数 347 浏览 1 评论 0 原文

我想知道是否有一种方法可以不排序而取消重复记录?有时,我想保留原始顺序,只想删除重复的记录。

是否可以?

顺便说一句,以下是我对不重复记录的了解,它最终会进行排序..

1.

proc sql;
   create table yourdata_nodupe as
   select distinct *
   From abc;
quit;

2.

proc sort data=YOURDATA nodupkey;    
    by var1 var2 var3 var4 var5;    
run;

I wonder if there is a way to unduplicate records WITHOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.

Is it possible?

BTW, below are what I know regarding unduplicating records, which does sorting in the end..

1.

proc sql;
   create table yourdata_nodupe as
   select distinct *
   From abc;
quit;

2.

proc sort data=YOURDATA nodupkey;    
    by var1 var2 var3 var4 var5;    
run;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

韬韬不绝 2024-11-09 03:39:17

您可以使用哈希对象来跟踪在传递数据集时看到的值。仅当遇到尚未观察到的键时输出。该输出按照输入数据集中观察数据的顺序进行输出。

以下是使用输入数据集“sashelp.cars”的示例。原始数据按 Make 的字母顺序排列,因此您可以看到输出数据集“nodupes”保持相同的顺序。

data nodupes (drop=rc);;
  length Make $13.;

  declare hash found_keys();
    found_keys.definekey('Make');
    found_keys.definedone();

  do while (not done);
    set sashelp.cars end=done;
    rc=found_keys.check();
    if rc^=0 then do;      
      rc=found_keys.add(); 
      output;              
    end;
  end;
  stop;
run;

proc print data=nodupes;run;

You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.

Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.

data nodupes (drop=rc);;
  length Make $13.;

  declare hash found_keys();
    found_keys.definekey('Make');
    found_keys.definedone();

  do while (not done);
    set sashelp.cars end=done;
    rc=found_keys.check();
    if rc^=0 then do;      
      rc=found_keys.add(); 
      output;              
    end;
  end;
  stop;
run;

proc print data=nodupes;run;
请止步禁区 2024-11-09 03:39:17
/* Give each record in the original dataset and row number */
data with_id ;
  set mydata ;
  _id = _n_ ;
run ;

/* Remove dupes */
proc sort data=with_id nodupkey ;
  by var1 var2 var3 ;
run ;

/* Sort back into original order */
proc sort data=with_id ;
  by _id ;
run ;

/* Give each record in the original dataset and row number */
data with_id ;
  set mydata ;
  _id = _n_ ;
run ;

/* Remove dupes */
proc sort data=with_id nodupkey ;
  by var1 var2 var3 ;
run ;

/* Sort back into original order */
proc sort data=with_id ;
  by _id ;
run ;

生来就爱笑 2024-11-09 03:39:17

我认为简短的答案是否定的,没有,至少没有一种方法不会比基于排序的方法对性能造成更大的影响。

可能在某些特定情况下这是可能的(所有变量都被索引的数据集?一个相对较小的数据集,您可以合理地加载到内存中并在那里使用?)但这不会帮助您使用通用方法。

Chris J 的解决方案可能是获得您想要的结果的最佳方法,但这并不是您实际问题的答案。

I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.

There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.

Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.

哭泣的笑容 2024-11-09 03:39:17

根据数据集中变量的数量,以下内容可能是实用的:

data abc_nodup;
   set abc;
   retain _var1 _var2 _var3 _var4;
   if _n_ eq 1 then output;
   else do;
      if (var1 eq _var1) and (var2 eq _var2) and
         (var3 eq _var3) and (var4 eq _var4)
         then delete;
      else output;
   end;
   _var1 = var1;
   _var2 = var2;
   _var3 = var3;
   _var4 = var4;
   drop _var:;
run;

Depending on the number of variables in your data set, the following might be practical:

data abc_nodup;
   set abc;
   retain _var1 _var2 _var3 _var4;
   if _n_ eq 1 then output;
   else do;
      if (var1 eq _var1) and (var2 eq _var2) and
         (var3 eq _var3) and (var4 eq _var4)
         then delete;
      else output;
   end;
   _var1 = var1;
   _var2 = var2;
   _var3 = var3;
   _var4 = var4;
   drop _var:;
run;
碍人泪离人颜 2024-11-09 03:39:17

请参阅使用说明 37581:如何在不排序的情况下从大型数据集中消除重复观察http://support.sas.com/kb/37/581.html 。使用说明 37581 展示了如何使用 PROC SUMMARY 来更有效地删除重复项而不使用排序。

Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.

但可醉心 2024-11-09 03:39:17

原帖中给出的两个例子并不相同。

  • proc sql中的distinct仅删除完全相同的行。proc
  • sort中的nodupkey删除关键变量相同的任何行(即使其他变量不相同)。您需要选项 noduprecs 来删除完全相同的行。

如果您只想查找具有公共关键变量的记录,我能想到的另一种解决方案是创建一个仅包含关键变量的数据集,并找出哪些是重复的,然后对原始数据应用一种格式来标记重复记录。如果数据集中存在多个关键变量,则需要创建一个新变量,其中包含所有关键变量值的串联 - 如果需要,则将其转换为字符。

The two examples given in the original post are not identical.

  • distinct in proc sql only removes lines which are fully identical
  • nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.

If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.

戴着白色围巾的女孩 2024-11-09 03:39:17

这是我能想到的最快的方法。它不需要排序。

data output_data_name;
    set input_data_name (
        sortedby = person_id stay
        keep =
            person_id
            stay
            ... more variables ...);
    by person_id stay;
    if first.stay > 0 then output;
run;

This is the fastest way I can think of. It requires no sorting.

data output_data_name;
    set input_data_name (
        sortedby = person_id stay
        keep =
            person_id
            stay
            ... more variables ...);
    by person_id stay;
    if first.stay > 0 then output;
run;
趴在窗边数星星i 2024-11-09 03:39:17
data output;
set yourdata;
by var notsorted;
if first.var then output;
run;

这不会对数据进行排序,但会删除每个组中的重复项。

data output;
set yourdata;
by var notsorted;
if first.var then output;
run;

This will not sort the data but will remove duplicates within each group.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文