当前位置：文江博客话题详情

sorting sas duplicates

如何在 SAS？中删除重复记录\观察而不进行排序；

发布于 2024-11-02 03:39:17 字数 347 浏览 1 评论 0 原文

我想知道是否有一种方法可以不排序而取消重复记录？有时，我想保留原始顺序，只想删除重复的记录。

是否可以？

顺便说一句，以下是我对不重复记录的了解，它最终会进行排序..

proc sql;
   create table yourdata_nodupe as
   select distinct *
   From abc;
quit;

proc sort data=YOURDATA nodupkey;    
    by var1 var2 var3 var4 var5;    
run;

原文

I wonder if there is a way to unduplicate records WITHOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.

Is it possible?

BTW, below are what I know regarding unduplicating records, which does sorting in the end..

proc sql;
   create table yourdata_nodupe as
   select distinct *
   From abc;
quit;

proc sort data=YOURDATA nodupkey;    
    by var1 var2 var3 var4 var5;    
run;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

韬韬不绝 2024-11-09 03:39:17

您可以使用哈希对象来跟踪在传递数据集时看到的值。仅当遇到尚未观察到的键时输出。该输出按照输入数据集中观察数据的顺序进行输出。

以下是使用输入数据集“sashelp.cars”的示例。原始数据按 Make 的字母顺序排列，因此您可以看到输出数据集“nodupes”保持相同的顺序。

data nodupes (drop=rc);;
  length Make $13.;

  declare hash found_keys();
    found_keys.definekey('Make');
    found_keys.definedone();

  do while (not done);
    set sashelp.cars end=done;
    rc=found_keys.check();
    if rc^=0 then do;      
      rc=found_keys.add(); 
      output;              
    end;
  end;
  stop;
run;

proc print data=nodupes;run;

You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.

Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.

data nodupes (drop=rc);;
  length Make $13.;

  declare hash found_keys();
    found_keys.definekey('Make');
    found_keys.definedone();

  do while (not done);
    set sashelp.cars end=done;
    rc=found_keys.check();
    if rc^=0 then do;      
      rc=found_keys.add(); 
      output;              
    end;
  end;
  stop;
run;

proc print data=nodupes;run;

回复收藏 0 原文

请止步禁区 2024-11-09 03:39:17

/* Give each record in the original dataset and row number */
data with_id ;
  set mydata ;
  _id = _n_ ;
run ;

/* Remove dupes */
proc sort data=with_id nodupkey ;
  by var1 var2 var3 ;
run ;

/* Sort back into original order */
proc sort data=with_id ;
  by _id ;
run ;

/* Give each record in the original dataset and row number */
data with_id ;
  set mydata ;
  _id = _n_ ;
run ;

/* Remove dupes */
proc sort data=with_id nodupkey ;
  by var1 var2 var3 ;
run ;

/* Sort back into original order */
proc sort data=with_id ;
  by _id ;
run ;

回复收藏 0 原文

生来就爱笑 2024-11-09 03:39:17

我认为简短的答案是否定的，没有，至少没有一种方法不会比基于排序的方法对性能造成更大的影响。

可能在某些特定情况下这是可能的（所有变量都被索引的数据集？一个相对较小的数据集，您可以合理地加载到内存中并在那里使用？）但这不会帮助您使用通用方法。

Chris J 的解决方案可能是获得您想要的结果的最佳方法，但这并不是您实际问题的答案。

回复收藏 0 原文

哭泣的笑容 2024-11-09 03:39:17

根据数据集中变量的数量，以下内容可能是实用的：

data abc_nodup;
   set abc;
   retain _var1 _var2 _var3 _var4;
   if _n_ eq 1 then output;
   else do;
      if (var1 eq _var1) and (var2 eq _var2) and
         (var3 eq _var3) and (var4 eq _var4)
         then delete;
      else output;
   end;
   _var1 = var1;
   _var2 = var2;
   _var3 = var3;
   _var4 = var4;
   drop _var:;
run;

Depending on the number of variables in your data set, the following might be practical:

data abc_nodup;
   set abc;
   retain _var1 _var2 _var3 _var4;
   if _n_ eq 1 then output;
   else do;
      if (var1 eq _var1) and (var2 eq _var2) and
         (var3 eq _var3) and (var4 eq _var4)
         then delete;
      else output;
   end;
   _var1 = var1;
   _var2 = var2;
   _var3 = var3;
   _var4 = var4;
   drop _var:;
run;

回复收藏 0 原文

碍人泪离人颜 2024-11-09 03:39:17

请参阅使用说明 37581：如何在不排序的情况下从大型数据集中消除重复观察，http://support.sas.com/kb/37/581.html 。使用说明 37581 展示了如何使用 PROC SUMMARY 来更有效地删除重复项而不使用排序。

回复收藏 0 原文

但可醉心 2024-11-09 03:39:17

原帖中给出的两个例子并不相同。

proc sql中的distinct仅删除完全相同的行。proc
sort中的nodupkey删除关键变量相同的任何行（即使其他变量不相同）。您需要选项 noduprecs 来删除完全相同的行。

如果您只想查找具有公共关键变量的记录，我能想到的另一种解决方案是创建一个仅包含关键变量的数据集，并找出哪些是重复的，然后对原始数据应用一种格式来标记重复记录。如果数据集中存在多个关键变量，则需要创建一个新变量，其中包含所有关键变量值的串联 - 如果需要，则将其转换为字符。

回复收藏 0 原文

戴着白色围巾的女孩 2024-11-09 03:39:17

这是我能想到的最快的方法。它不需要排序。

data output_data_name;
    set input_data_name (
        sortedby = person_id stay
        keep =
            person_id
            stay
            ... more variables ...);
    by person_id stay;
    if first.stay > 0 then output;
run;

This is the fastest way I can think of. It requires no sorting.

data output_data_name;
    set input_data_name (
        sortedby = person_id stay
        keep =
            person_id
            stay
            ... more variables ...);
    by person_id stay;
    if first.stay > 0 then output;
run;

回复收藏 0 原文

趴在窗边数星星i 2024-11-09 03:39:17

data output;
set yourdata;
by var notsorted;
if first.var then output;
run;

这不会对数据进行排序，但会删除每个组中的重复项。

data output;
set yourdata;
by var notsorted;
if first.var then output;
run;

This will not sort the data but will remove duplicates within each group.

回复收藏 0 原文

~没有更多了~

关于作者

三生池水覆流年

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

如何在 SAS？中删除重复记录\观察而不进行排序；

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如何在 SAS？ 中删除重复记录\观察而不进行排序；

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如何在 SAS？中删除重复记录\观察而不进行排序；

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。