SAS Proc SQL 合并时是否使用索引

发布于 2024-08-08 07:39:22 字数 1361 浏览 7 评论 0原文

考虑以下（诚然很长）的示例。

示例代码创建两个数据集，数据一具有“关键”变量 i、j、k，数据二具有关键变量 j、k 和“值”变量 x。我想尽可能有效地合并这两个数据集。两个数据集都根据 j 和 k 进行索引：不需要第一个数据的索引，但无论如何它都在那里。

Proc SQL 不使用数据二中的索引，我想如果数据位于关系数据库中就会出现这种情况。这只是我必须接受的查询优化器的限制吗？

编辑：这个问题的答案是肯定的，SAS可以使用索引来优化PROC SQL连接。在以下示例中，数据集的相对大小很重要：如果修改代码以使数据二变得相对大于数据一，则将使用索引。数据集是否排序并不重要。

* Just to control the size of the data;
%let j_max=10000;

* Create data sets;
data one;
    do i=1 to 3;
        do j=1 to &j_max;
            do k=1 to 4;
                if ranuni(0)<0.9 then output;
            end;
        end;
    end;
run;

data two;
    do j=1 to &j_max;
        do k=1 to 4;
            x=ranuni(0);
            if ranuni(0)<0.9 then output;
        end;
    end;
run;

* Create indices;
proc datasets library=work nolist;
    modify one;
    index create idx_j_k=(j k);
    modify two;
    index create idx_j_k=(j k) / unique;
run;quit;

* Test the use of an index for the other data set:
* Log should display "INFO: Index idx_j_k selected for WHERE clause optimization.";
options msglevel=i;
data _null_;
    set two(where=(j<100));
run;

* Merge the data sets with proc sql - no index is used;
proc sql;
    create table onetwo as
    select
        one.*,
        two.x
    from one, two
    where
        one.j=two.j and
        one.k=two.k;
quit;

原文

Consider the following (admittedly long) example.

The sample code creates two data sets, data one with "key" variables i,j,k and data two with key variables j,k and a "value" variable x. I'd like to merge these two data sets as efficiently as possible. Both of the data sets are indexed with respect to j and k: the index for the first data should not be needed but it's there anyway.

Proc SQL does not use the index in the data two, which I suppose would be the case if the data were in a relational database. Is this just a limitation of the query optimizer I have to accept?

EDIT: The answer to this question is yes, SAS can use an index to optimize a PROC SQL join. In the following example, the relative sizes of the data sets matters: If you modify the code so that data two becomes relatively larger than data one, index will be used. Whether the data sets are sorted or not, does not matter.

* Just to control the size of the data;
%let j_max=10000;

* Create data sets;
data one;
    do i=1 to 3;
        do j=1 to &j_max;
            do k=1 to 4;
                if ranuni(0)<0.9 then output;
            end;
        end;
    end;
run;

data two;
    do j=1 to &j_max;
        do k=1 to 4;
            x=ranuni(0);
            if ranuni(0)<0.9 then output;
        end;
    end;
run;

* Create indices;
proc datasets library=work nolist;
    modify one;
    index create idx_j_k=(j k);
    modify two;
    index create idx_j_k=(j k) / unique;
run;quit;

* Test the use of an index for the other data set:
* Log should display "INFO: Index idx_j_k selected for WHERE clause optimization.";
options msglevel=i;
data _null_;
    set two(where=(j<100));
run;

* Merge the data sets with proc sql - no index is used;
proc sql;
    create table onetwo as
    select
        one.*,
        two.x
    from one, two
    where
        one.j=two.j and
        one.k=two.k;
quit;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

似梦非梦 2024-08-15 07:39:22

您可能正在比较苹果和橙子。对于使用 proc sql 进行的联接，索引可能没有帮助，因为观察值已经按 j 和 k 排序，并且有比使用索引更快的“合并”方法。

另一方面，对于使用 data _null_ 步骤进行的子集化，j 上的索引肯定会有所帮助。如果您对 proc sql 进行相同的子集设置，您将看到它正在使用索引。

proc sql;
  select * from two where j < 100;
quit;
/* on log
INFO: Index idx_j_k selected for WHERE clause optimization.
*/

顺便说一句，您可以使用未记录的 _method 选项来检查 proc sql 如何执行查询。在我的 Windows 上的 sas 9.2 上，它报告它正在执行所谓的“散列连接”：

proc sql _method;
  create table onetwo as
  select
    one.*,
    two.x
  from one, two
  where
    one.j=two.j and
    one.k=two.k;
quit;

/* on log
NOTE: SQL execution methods chosen are:

  sqxcrta
      sqxjhsh
          sqxsrc( WORK.ONE )
          sqxsrc( WORK.TWO )
*/

请参阅 Paul Kent 的技术说明了解更多信息。

You may be comparing apples and oranges. For the join you do with proc sql, the index may not help because the observations are already ordered by j and k and there are faster ways to do "merging" than using indices.

For the subsetting you do with the data _null_ step, on the other hand, an index on j would surely help. If you do the same subsetting with the proc sql, you will see that it is using the index.

proc sql;
  select * from two where j < 100;
quit;
/* on log
INFO: Index idx_j_k selected for WHERE clause optimization.
*/

By the way, you can use the undocumented _method option to examine how proc sql executes your query. On my sas 9.2 on windows, it reports that it is doing what is called a "hash join":

proc sql _method;
  create table onetwo as
  select
    one.*,
    two.x
  from one, two
  where
    one.j=two.j and
    one.k=two.k;
quit;

/* on log
NOTE: SQL execution methods chosen are:

  sqxcrta
      sqxjhsh
          sqxsrc( WORK.ONE )
          sqxsrc( WORK.TWO )
*/

See Paul Kent's Tech note for more information .

回复收藏 0 原文

~没有更多了~