当前位置：文江博客话题详情

什么对 SAS 数据集性能影响更大 - 观测值数量还是变量数量？

发布于 2024-08-07 12:58:37 字数 175 浏览 5 评论 0原文

在 SAS 中使用不同的数据集一两个月后，在我看来，数据集的变量越多，在数据集上运行 PROC 和其他操作所需的时间就越多。然而，如果我有 5 个变量，但有 100 万个观察值，性能不会受到太大影响。

虽然我对观察或变量是否影响性能感兴趣，但我也想知道在查看 SAS 性能时是否还遗漏了其他因素？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌伤浅笑 2024-08-14 12:58:37

对于相同大小的数据集（行*列），我相信具有更多变量的数据集通常会更慢。我尝试创建两个包含 1 行和 10000 列或 1 列和 10000 行的数据集。变量越多，占用的内存和时间就越多。

options fullstimer;
data a;
    retain var1-var10000 1;
run;
data b(drop=i);
    do i=1 to 10000;
    var1=i;
    output;
    end;
run;

在日志上

31   options fullstimer;
32   data a;
33       retain var1-var10000 1;
34   run;

NOTE: The data set WORK.A has 1 observations and 10000 variables.
NOTE: DATA statement used (Total process time):
      real time           0.23 seconds
      user cpu time       0.20 seconds
      system cpu time     0.03 seconds
      Memory                            5382k
      OS Memory                         14208k
      Timestamp            10/14/2009  2:03:57 PM


35   data b(drop=i);
36       do i=1 to 10000;
37       var1=i;
38       output;
39       end;
40   run;

NOTE: The data set WORK.B has 10000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      Memory                            173k
      OS Memory                         12144k
      Timestamp            10/14/2009  2:03:57 PM

您还应该查看 BUFNO= 和 BUFSIZE=< /a>.如果您必须多次访问数据集，您可以考虑使用 SASFILE 以及将整个数据集存储在内存中。

For the same size data set (rows*columns) the one with more variables will usually be slower I believe. I tried creating two data sets with either 1 row and 10000 columns, or 1 column and 10000 rows. The one with more variables took a lot more memory and time.

options fullstimer;
data a;
    retain var1-var10000 1;
run;
data b(drop=i);
    do i=1 to 10000;
    var1=i;
    output;
    end;
run;

On the Log

31   options fullstimer;
32   data a;
33       retain var1-var10000 1;
34   run;

NOTE: The data set WORK.A has 1 observations and 10000 variables.
NOTE: DATA statement used (Total process time):
      real time           0.23 seconds
      user cpu time       0.20 seconds
      system cpu time     0.03 seconds
      Memory                            5382k
      OS Memory                         14208k
      Timestamp            10/14/2009  2:03:57 PM


35   data b(drop=i);
36       do i=1 to 10000;
37       var1=i;
38       output;
39       end;
40   run;

NOTE: The data set WORK.B has 10000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      Memory                            173k
      OS Memory                         12144k
      Timestamp            10/14/2009  2:03:57 PM

You should also check out BUFNO= and BUFSIZE=. If you have to access a data set many times, you might consider using SASFILE as well to store the entire data set in memory.

回复收藏 0 原文

枉心 2024-08-14 12:58:37

我不太清楚（并且正在做出有根据的猜测），但我想这与多种因素有关，包括将整个记录读入 PDV，这意味着内存中存在更多包含许多变量的数据。

使用压缩数据集进行一些测量可能是值得的，因为 I/O 通常是瓶颈。

SAS 数据集选项：

data foo(compress=yes);
...
run;

I can't quite elucidate (and am making an educated guess), but I imagine it has something to do with a combination of factors, including that a whole record is read into the PDV, which means more data sits in memory with many variables.

It might be worth doing some measurements with compressed datasets, because I/O is often the bottleneck.

SAS dataset option: