在 SAS 8.2 中合并/连接 2 个大型数据集的有效方法
我尝试了以下选项,但响应时间不可接受 - 创建索引“key”也没有帮助(注意:两个数据集中重复的“keys”):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
创建密钥是准备加入数据集的最快方法。 对它们进行排序可能需要与合并它们一样长的时间,甚至更长,但仍然是一个好主意。
AFHood 的修剪它们的建议是一个很好的建议。 您可以通过过程摘要来运行它们吗? 是否有任何可以删除的列? 其中任何一个都会减少数据集的大小并使合并速度更快。
然而,这些方法都可能不起作用。 我经常合并数百万行的文件,这可能需要一段时间。
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
您可以尝试 SQL 合并。 我不知道它是否会更快地满足您的需求,但我发现 SQL 比常规 SAS 合并效率更高。 另外,一旦您意识到可以使用 SQL 做什么,操作数据集就会变得更加容易!
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
不要使用数据步骤合并来执行此操作。
如果两个数据集中有重复的键,结果将是错误的。
执行此操作的唯一方法是使用
如果您有多个关键字段,则连接顺序应为最不匹配字段第一个到最大匹配字段最后。 SAS 有一个坏习惯,即创建一个包含所有可能匹配项的临时数据集,然后从那里筛选它。 可能会耗尽您的临时空间分配并减慢一切速度。
如果您仍然希望使用数据步骤,那么您需要从其中一个数据集中删除重复的键。
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.
在合并两个数据集之前,您应该首先对它们进行排序。 这就是性能的体现。 当您必须扫描整个表以获得结果时使用索引通常比预排序数据集并合并它们要慢。
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
请务必尽可能地修剪数据集。 在数据步骤或 proc sql 之前对数据集进行排序。 另外,如果重要的话,我也不是 100%,但 ANSI SQL 应该是 proc sql; 创建 a as select * from b left external join c on b.key=C.KEY; 退出;
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be
proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;