R 中的频率加权,与 Stata 的结果比较
我正在尝试分析明尼苏达大学 IPUMS 数据集的1990 年美国人口普查 在 R
中。我正在使用 调查
包,因为数据是 加权。仅获取家庭数据(并忽略个人变量以保持简单),我尝试计算 hh Revenue
的平均值 (家庭收入)。为此,我使用 调查设计对象 “svydesign 函数的文档”>svydesign()
函数包含以下代码:
> require(foreign)
> ipums.household <- read.dta("/path/to/stata_export.dta")
> ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing
> ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household)
> svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE)
mean SE
[1,] 37029 17.365
到目前为止一切顺利。但是,如果我尝试在 Stata
中进行相同的计算(使用 用于同一数据集不同部分的代码 ):
use "C:\I\Hate\Backslashes\stata_export.dta"
replace hhincome = . if hhincome == 9999999
(933734 real changes made, 933734 to missing)
mean hhincome [fweight = hhwt] # The code from the link above.
Mean estimation Number of obs = 91746420
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
hhincome | 37028.99 3.542749 37022.05 37035.94
--------------------------------------------------------------
并且,调查
的作者在寻找另一种给这只猫剥皮的方法,有频率加权的建议:
expanded.data<-as.data.frame(lapply(compressed.data,
function(x) rep(x,compressed.data$weights)))
但是,我似乎无法让这段代码工作:
> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt)
> expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt)))
Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument
我似乎无法修复。这可能与 此问题。
总而言之:
- 为什么我在
Stata
和R
中没有得到相同的答案? - 哪一个是正确的(或者我在这两种情况下都做错了什么)?
- 假设我让
rep()
解决方案正常工作,这会复制Stata
的结果吗? - 正确的做法是什么?如果答案允许我使用
plyr
用于执行任意计算的包,而不是仅限于survey
中实现的函数(svymean()
、svyglm()
等)
更新
因此,在我通过电子邮件从 IPUMS 获得了出色的帮助后,我使用以下代码来正确处理调查权重。我在这里描述以防其他人将来遇到这个问题。
初始 Stata 准备
由于 IPUMS 目前未发布用于将数据导入 R
的脚本,因此您需要从 Stata
开始、SAS
或 SPSS
。我现在会继续使用 Stata
。首先从 IPUMS 运行导入脚本。然后,在继续添加以下变量之前:
generate strata = statefip*100000 + puma
这将为每个 PUMA
创建一个 240001 形式的唯一整数,前两位数字作为州 fip 代码(马里兰州为 24),后四位数字为PUMA
id 在每个州都是唯一的。如果您要使用 R
您可能还会发现运行它也很有帮助。
generate statefip_num = statefip * 1
这将创建一个没有标签的附加变量,因为导入 .dta< /code> 文件到
R
应用标签并丢失底层整数。
Stata 和 svyset
正如 Keith 所解释的,调查抽样是由 Stata
通过调用 svyset
来处理的。
对于个人层面的分析,我现在使用:
svyset serial [pweight=perwt], strata(strata)
这将权重设置为 perwt
,将分层设置为我们上面创建的变量,并使用家庭serial
号码来说明聚类。如果我们使用多年,我们可能还想尝试
generate double yearserial = year*100000000 + serial
考虑纵向聚类。
对于家庭层面的分析(没有年份):
svyset serial [pweight=hhwt], strata(strata)
应该是不言自明的(尽管我认为在这种情况下序列实际上是多余的)。将 serial
替换为 yearserial
将考虑时间序列。
在 R
中执行此操作
假设您正在导入一个 .dta
文件,其中包含上面解释的附加 strata
变量,并在单个字母处进行分析:
require(foreign)
ipums <- read.dta('/path/to/data.dta')
require(survey)
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
或者家庭层面:
ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt)
希望有人觉得这有帮助,非常感谢 IPUMS 的 Dwin、Keith 和 Brandon。
I'm trying to analyze data from the University of Minnesota IPUMS dataset for the 1990 US census in R
. I'm using the survey
package because the data is weighted. Just taking the household data (and ignoring the person variables to keep things simple), I am attempting to calculate the mean of hhincome
(household income). To do this I created a survey design object using the svydesign()
function with the following code:
> require(foreign)
> ipums.household <- read.dta("/path/to/stata_export.dta")
> ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing
> ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household)
> svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE)
mean SE
[1,] 37029 17.365
So far so good. However, I get a different standard error if I attempt the same calculation in Stata
(using code meant for a different portion of the same dataset):
use "C:\I\Hate\Backslashes\stata_export.dta"
replace hhincome = . if hhincome == 9999999
(933734 real changes made, 933734 to missing)
mean hhincome [fweight = hhwt] # The code from the link above.
Mean estimation Number of obs = 91746420
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
hhincome | 37028.99 3.542749 37022.05 37035.94
--------------------------------------------------------------
And, looking at another way to skin this cat, the author of survey
, has this suggestion for frequency weighting:
expanded.data<-as.data.frame(lapply(compressed.data,
function(x) rep(x,compressed.data$weights)))
However, I can't seem to get this code to work:
> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt)
> expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt)))
Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument
Which I can't seem to fix. This may be related to this issue.
So in sum:
- Why don't I get the same answers in
Stata
andR
? - Which one is right (or am I doing something wrong in both cases)?
- Assuming I got the
rep()
solution working, would that replicateStata
's results? - What's the right way to do it? Kudos if the answer allows me to use the
plyr
package for doing arbitrary calculations, rather than being limited to the functions implemented insurvey
(svymean()
,svyglm()
etc.)
Update
So after the excellent help I've received here and from IPUMS via email, I'm using the following code to properly handle survey weighting. I describe here in case someone else has this problem in future.
Initial Stata Preparation
Since IPUMS don't currently publish scripts for importing their data into R
, you'll need to start from Stata
, SAS
, or SPSS
. I'll stick with Stata
for now. Begin by running the import script from IPUMS. Then before continuing add the following variable:
generate strata = statefip*100000 + puma
This creates a unique integer for each PUMA
of the form 240001, with first two digits as the state fip code (24 in the case of Maryland) and the last four a PUMA
id which is unique on a per state basis. If you're going to use R
you might also find it helpful to run this as well
generate statefip_num = statefip * 1
This will create an additional variable without labels, since importing .dta
files into R
apply the labels and lose the underlying integers.
Stata and svyset
As Keith explained, survey sampling is handled by Stata
by invoking svyset
.
For an individual level analysis I now use:
svyset serial [pweight=perwt], strata(strata)
This sets the weighting to perwt
, the stratification to the variable we created above, and uses the household serial
number to account for clustering. If we were using multiple years, we might want to try
generate double yearserial = year*100000000 + serial
to account for longitudinal clustering as well.
For household level analysis (without years):
svyset serial [pweight=hhwt], strata(strata)
Should be self-explanatory (though I think in this case serial is actually superfluous). Replacing serial
with yearserial
will take into account a time series.
Doing it in R
Assuming you're importing a .dta
file with the additional strata
variable explained above and analysing at the individual letter:
require(foreign)
ipums <- read.dta('/path/to/data.dta')
require(survey)
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Or at the household level:
ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt)
Hope someone finds this helpful, and thanks so much to Dwin, Keith and Brandon from IPUMS.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
1&2) 您引用的 Lumley 评论是 2001 年写的,早于他发表的调查包的任何作品,而调查包才发布了几年。您可能在两种不同的意义上使用“权重”。 (Lumley 在他的书的早期描述了三种可能的含义。)调查函数 svydesign 使用概率权重而不是频率权重。考虑到该数据集的巨大规模,这些似乎并不是真正的频率权重,而是概率权重,这意味着调查包结果是正确的,而 Stata 结果是错误的。如果您不相信,那么 Survey 包提供了函数 as.svrepdesign(),Lumley 的书中描述了如何从 svydesign 对象创建复制权重向量。
3)我认为是这样,但正如 RMN 所说......“这是错误的。”
4)因为它是错误的(IMO),所以没有必要。
1&2) The comment you cited from Lumley was written in 2001 and predates any of his published work with the survey package which has only been out a few years. You are probably using "weights" in two different senses. (Lumley describes three possible senses early in his book.) The survey function svydesign is using probability weights rather than frequency weights. Seems likely that these are not really frequency weights but rather probability weights, given the massive size of that dataset, and that would mean that the survey package result is correct and the Stata result incorrect. If you are not convinced, then the survey package offers the function as.svrepdesign() with which Lumley's book describes how to create a replicate weight vector from a svydesign-object.
3) I think so, but as RMN said ..."It would be wrong."
4) Since it's wrong (IMO) it's not necessary.
您不应该在 Stata 中使用频率权重。这很清楚。如果 IPUMS 没有“复杂”的调查设计,您可以使用:
或者,为了方便起见:
第二个选项的好处是您可以将它用于更复杂的调查设计(通过 svyset。然后你就可以运行很多命令,而不必一直输入 [pw...]。
You shouldn't be using frequency weights in Stata. That is pretty clear. If IPUMS doesn't have a "complex" survey design, you can just use:
Or, for convenience:
What's nice about the second option is that you can use it for more complex survey designs (via options on svyset. Then you can run lots of commands without having to typ [pw...] all the time.
对于无法访问 Stata 或 SAS 的人来说,略有增加; (我会把这个放在评论中,但是......)
SAScii库可以使用SAS代码文件读入IPUMS下载的数据。读取数据的代码来自 doc
Slight addition for people who don't have access to Stata or SAS; (I would put this in comments but...)
The library SAScii can use the SAS code file to read in the IPUMS downloaded data. The code to read in the data is from the doc