SAS Proc SQL 合并时是否使用索引
考虑以下(诚然很长)的示例。
示例代码创建两个数据集,数据一具有“关键”变量 i、j、k,数据二具有关键变量 j、k 和“值”变量 x。我想尽可能有效地合并这两个数据集。两个数据集都根据 j 和 k 进行索引:不需要第一个数据的索引,但无论如何它都在那里。
Proc SQL 不使用数据二中的索引,我想如果数据位于关系数据库中就会出现这种情况。这只是我必须接受的查询优化器的限制吗?
编辑:这个问题的答案是肯定的,SAS可以使用索引来优化PROC SQL连接。在以下示例中,数据集的相对大小很重要:如果修改代码以使数据二变得相对大于数据一,则将使用索引。数据集是否排序并不重要。
* Just to control the size of the data;
%let j_max=10000;
* Create data sets;
data one;
do i=1 to 3;
do j=1 to &j_max;
do k=1 to 4;
if ranuni(0)<0.9 then output;
end;
end;
end;
run;
data two;
do j=1 to &j_max;
do k=1 to 4;
x=ranuni(0);
if ranuni(0)<0.9 then output;
end;
end;
run;
* Create indices;
proc datasets library=work nolist;
modify one;
index create idx_j_k=(j k);
modify two;
index create idx_j_k=(j k) / unique;
run;quit;
* Test the use of an index for the other data set:
* Log should display "INFO: Index idx_j_k selected for WHERE clause optimization.";
options msglevel=i;
data _null_;
set two(where=(j<100));
run;
* Merge the data sets with proc sql - no index is used;
proc sql;
create table onetwo as
select
one.*,
two.x
from one, two
where
one.j=two.j and
one.k=two.k;
quit;
Consider the following (admittedly long) example.
The sample code creates two data sets, data one with "key" variables i,j,k and data two with key variables j,k and a "value" variable x. I'd like to merge these two data sets as efficiently as possible. Both of the data sets are indexed with respect to j and k: the index for the first data should not be needed but it's there anyway.
Proc SQL does not use the index in the data two, which I suppose would be the case if the data were in a relational database. Is this just a limitation of the query optimizer I have to accept?
EDIT: The answer to this question is yes, SAS can use an index to optimize a PROC SQL join. In the following example, the relative sizes of the data sets matters: If you modify the code so that data two becomes relatively larger than data one, index will be used. Whether the data sets are sorted or not, does not matter.
* Just to control the size of the data;
%let j_max=10000;
* Create data sets;
data one;
do i=1 to 3;
do j=1 to &j_max;
do k=1 to 4;
if ranuni(0)<0.9 then output;
end;
end;
end;
run;
data two;
do j=1 to &j_max;
do k=1 to 4;
x=ranuni(0);
if ranuni(0)<0.9 then output;
end;
end;
run;
* Create indices;
proc datasets library=work nolist;
modify one;
index create idx_j_k=(j k);
modify two;
index create idx_j_k=(j k) / unique;
run;quit;
* Test the use of an index for the other data set:
* Log should display "INFO: Index idx_j_k selected for WHERE clause optimization.";
options msglevel=i;
data _null_;
set two(where=(j<100));
run;
* Merge the data sets with proc sql - no index is used;
proc sql;
create table onetwo as
select
one.*,
two.x
from one, two
where
one.j=two.j and
one.k=two.k;
quit;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能正在比较苹果和橙子。对于使用 proc sql 进行的联接,索引可能没有帮助,因为观察值已经按 j 和 k 排序,并且有比使用索引更快的“合并”方法。
另一方面,对于使用
data _null_
步骤进行的子集化,j
上的索引肯定会有所帮助。如果您对 proc sql 进行相同的子集设置,您将看到它正在使用索引。顺便说一句,您可以使用未记录的
_method
选项来检查proc sql
如何执行查询。在我的 Windows 上的 sas 9.2 上,它报告它正在执行所谓的“散列连接”:请参阅 Paul Kent 的 技术说明了解更多信息。
You may be comparing apples and oranges. For the join you do with
proc sql
, the index may not help because the observations are already ordered by j and k and there are faster ways to do "merging" than using indices.For the subsetting you do with the
data _null_
step, on the other hand, an index onj
would surely help. If you do the same subsetting with theproc sql
, you will see that it is using the index.By the way, you can use the undocumented
_method
option to examine howproc sql
executes your query. On my sas 9.2 on windows, it reports that it is doing what is called a "hash join":See Paul Kent's Tech note for more information .