SPSS中如何聚合IQR?

发布于 2024-10-31 09:07:20 字数 661 浏览 5 评论 0原文

我必须通过得出所需变量的平均值、中位数、标准差和四分位数范围 (IQR) 来聚合(当然使用分类中断变量)一个包含一些连续变量的相当大的数据表。

前三个是使用 SPSS Aggregate 命令的简单方法,但我不知道如何通过聚合数据表来计算 IQR。

我知道我可以使用描述(按四分位数)来计算 IQR,但由于我需要聚合计算 - 这不是一个选项。不幸的是,由于一些奇怪的情况,使用 R 也失败了(无法在 R 中加载一个巨大的逗号分隔文件,既不能使用 base:: read.table,也不能使用 sqldf,既不与 bigmemory 也不与 ff 包)。

欢迎任何想法!当然:提前谢谢您。


PS:我考虑过通过将标准差乘以 1.5 来估计 IQR,但该方法不起作用,因为分布是倾斜的,因此假设正态性不成立。

PS:您认为在 SPSS 中使用 R 不会导致像在纯 R 中打开数据集时那样的内存问题吗?

I have to aggregate (of course with a categorical break variable) a quite big data table containing some continuous variables by resulting the mean, median, standard deviation and interquartile range (IQR) of the required variables.

The first three is an easy one with the SPSS Aggregate command, but I have no idea how to compute IQR by aggregating the data table.

I know I could compute IQR by using Descriptives (by quartiles), but as I need the calculations in aggregation - this is not an option. Unfortunately using R fails also thanks to some odd circumstances (not able to load a huge comma separated file in R neither with base:: read.table, neither with sqldf, neither with bigmemory and neither with ff packages).

Any idea is welcomed! And of course: thank you in advance.


P.S.: I thought about estimating IQR by multiplying the standard deviation by 1.5, but that method would not work as the distributions are skewed, so assuming normality does not stands.

P.S.: do you think using R within SPSS would not result in memory problems like while opening the dataset in pure R?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

森林散布 2024-11-07 09:07:20

这个语法应该可以解决问题。无需仅为此任务在 SPSS 和 R 之间来回迁移。

*making fake data, 4 million records and 150 variables.
input program.
loop i = 1 to 4000000.
end case.
end loop.
end file.
end input program.
dataset name Temp.
execute.

vector X(150).
do repeat X = X1 to X150.
compute X = RV.NORMAL(0,1).
end repeat.

*This is the command you are interested in, puts the stats table into a new dataset.
Dataset declare IQR.
OMS
/SELECT TABLES
/IF SUBTYPES = 'Statistics'
/DESTINATION FORMAT = SAV outfile = 'IQR' VIEWER=NO.
freq var = X1
/format = notable
/ntiles = 4.
OMSEND.

对于如此大的数据集,这仍然需要很长时间,但这是可以预料的。只需在 SPSS 帮助文件中搜索“OMS”即可找到 OMS 工作原理的示例语法。


考虑到您想要计算许多组的 IQR 的进一步限制,我可以看到几种不同的方法来进行。一种方法是使用 split file 命令并再次运行上述频率命令。

split file by group.
freq var = X1 X2
/format = notable
/ntiles = 4.
split file end.

您还可以在 ctables 中获得特定的百分位数(并且可以执行您想要的任何分组/嵌套)。不过,此时可能更有用的解决方案是制作一个程序,该程序实际上保存单独的文件(或在仍然加载的情况下减少特定组的完整数据集),对每个单独的文件进行计算并将其转储到数据集中。使用具有 400 万条记录的数据集是一件痛苦的事情,如果您只是拆分文件,则似乎没有必要。这可以通过宏命令来完成。

This syntax should do the trick. There is no need to migrate back and forth between SPSS and R solely for this task.

*making fake data, 4 million records and 150 variables.
input program.
loop i = 1 to 4000000.
end case.
end loop.
end file.
end input program.
dataset name Temp.
execute.

vector X(150).
do repeat X = X1 to X150.
compute X = RV.NORMAL(0,1).
end repeat.

*This is the command you are interested in, puts the stats table into a new dataset.
Dataset declare IQR.
OMS
/SELECT TABLES
/IF SUBTYPES = 'Statistics'
/DESTINATION FORMAT = SAV outfile = 'IQR' VIEWER=NO.
freq var = X1
/format = notable
/ntiles = 4.
OMSEND.

This takes along time still with such a large dataset, but thats to be expected. Just search the SPSS help files for "OMS" to find the example syntax with how OMS works.


Given the further constraint that you want to calculate the IQR for many groups, there is a few different ways I could see to proceed. One would be just use the split file command and run the above frequency command again.

split file by group.
freq var = X1 X2
/format = notable
/ntiles = 4.
split file end.

You could also get specific percentiles within ctables (and can do whatever grouping/nesting you want for that). Potentially a more useful solution at this point though is to make a program that actually saves separate files (or reduces the full dataset the specific group while still loaded), does the calculation on each separate file and dumps it into a dataset. Working with the dataset that has the 4 million records is a pain, and it does not appear to be necessary if you are just splitting the file up anyway. This could be accomplished via macro commands.

放肆 2024-11-07 09:07:20

OMS 可以捕获任何数据透视表作为数据集,因此以这种方式显示的任何统计结果都可以用作数据集。然而,在这种情况下,另一种方法是使用 RANK 命令。 RANK 允许对变量进行分组,因此您可以获得组内的排名,并且它可以计算组内的四分位数和百分位数。例如,
排名变量=工资 (A) 按工作类别少数
/排名 /NTILES(4) /百分比。然后将 FIRST 和组变量作为中断进行聚合,将为您提供按组计算 iqr 的四分位数数据集。

给猫剥皮的方法有很多种。

——乔恩·佩克

OMS can capture any pivot table as a dataset, so any statistical results displayed that way can be used as a dataset. Another approach, however, in this case would be to use the RANK command. RANK allows for grouping variables, so you could get rank within group, and it can compute the quartiles and percentiles within group. For example,
RANK VARIABLES=salary (A) BY jobcat minority
/RANK /NTILES(4) /PERCENT. Then aggregating with FIRST and the group variables as breaks would give you a dataset of the quartiles by group from which to compute the iqr.

Many ways to skin a cat.

-Jon Peck

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文