在 SAS 8.2 中合并/连接 2 个大型数据集的有效方法

发布于 2024-07-13 15:46:02 字数 251 浏览 7 评论 0原文

我尝试了以下选项，但响应时间不可接受 - 创建索引“key”也没有帮助（注意：两个数据集中重复的“keys”）：

data a;
merge b
      c;
by key
if b;
run;

=== OR ===

proc sql;
create a
as select * 
from b
left outer join c
on b.key;
quit;

原文

I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):

data a;
merge b
      c;
by key
if b;
run;

=== OR ===

proc sql;
create a
as select * 
from b
left outer join c
on b.key;
quit;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情绪失控 2024-07-20 15:46:03

创建密钥是准备加入数据集的最快方法。对它们进行排序可能需要与合并它们一样长的时间，甚至更长，但仍然是一个好主意。

AFHood 的修剪它们的建议是一个很好的建议。您可以通过过程摘要来运行它们吗？是否有任何可以删除的列？其中任何一个都会减少数据集的大小并使合并速度更快。

然而，这些方法都可能不起作用。我经常合并数百万行的文件，这可能需要一段时间。

回复收藏 0 原文

日记撕了你也走了 2024-07-20 15:46:03

您可以尝试 SQL 合并。我不知道它是否会更快地满足您的需求，但我发现 SQL 比常规 SAS 合并效率更高。另外，一旦您意识到可以使用 SQL 做什么，操作数据集就会变得更加容易！

回复收藏 0 原文

谁与争疯 2024-07-20 15:46:03

不要使用数据步骤合并来执行此操作。
如果两个数据集中有重复的键，结果将是错误的。
执行此操作的唯一方法是使用

Proc SQL;   
  Create table newdata    
  as select firsttable.aster, secondtable.aster    
  from table1 as firsttable   
  inner join table2 as secondtable    
  on (firstable.keyfield = secondtable.keyfield);   
quit;

如果您有多个关键字段，则连接顺序应为最不匹配字段第一个到最大匹配字段最后。 SAS 有一个坏习惯，即创建一个包含所有可能匹配项的临时数据集，然后从那里筛选它。可能会耗尽您的临时空间分配并减慢一切速度。

如果您仍然希望使用数据步骤，那么您需要从其中一个数据集中删除重复的键。

Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a

Proc SQL;   
  Create table newdata    
  as select firsttable.aster, secondtable.aster    
  from table1 as firsttable   
  inner join table2 as secondtable    
  on (firstable.keyfield = secondtable.keyfield);   
quit;

If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.

If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.

回复收藏 0 原文