如何在 SAS? 中删除重复记录\观察而不进行排序;
我想知道是否有一种方法可以不排序而取消重复记录?有时,我想保留原始顺序,只想删除重复的记录。
是否可以?
顺便说一句,以下是我对不重复记录的了解,它最终会进行排序..
1.
proc sql;
create table yourdata_nodupe as
select distinct *
From abc;
quit;
2.
proc sort data=YOURDATA nodupkey;
by var1 var2 var3 var4 var5;
run;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您可以使用哈希对象来跟踪在传递数据集时看到的值。仅当遇到尚未观察到的键时输出。该输出按照输入数据集中观察数据的顺序进行输出。
以下是使用输入数据集“sashelp.cars”的示例。原始数据按 Make 的字母顺序排列,因此您可以看到输出数据集“nodupes”保持相同的顺序。
You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.
Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.
我认为简短的答案是否定的,没有,至少没有一种方法不会比基于排序的方法对性能造成更大的影响。
可能在某些特定情况下这是可能的(所有变量都被索引的数据集?一个相对较小的数据集,您可以合理地加载到内存中并在那里使用?)但这不会帮助您使用通用方法。
Chris J 的解决方案可能是获得您想要的结果的最佳方法,但这并不是您实际问题的答案。
I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.
There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.
Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.
根据数据集中变量的数量,以下内容可能是实用的:
Depending on the number of variables in your data set, the following might be practical:
请参阅使用说明 37581:如何在不排序的情况下从大型数据集中消除重复观察,http://support.sas.com/kb/37/581.html 。使用说明 37581 展示了如何使用 PROC SUMMARY 来更有效地删除重复项而不使用排序。
Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.
原帖中给出的两个例子并不相同。
如果您只想查找具有公共关键变量的记录,我能想到的另一种解决方案是创建一个仅包含关键变量的数据集,并找出哪些是重复的,然后对原始数据应用一种格式来标记重复记录。如果数据集中存在多个关键变量,则需要创建一个新变量,其中包含所有关键变量值的串联 - 如果需要,则将其转换为字符。
The two examples given in the original post are not identical.
If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.
这是我能想到的最快的方法。它不需要排序。
This is the fastest way I can think of. It requires no sorting.
这不会对数据进行排序,但会删除每个组中的重复项。
This will not sort the data but will remove duplicates within each group.