R 中的频率加权，与 Stata 的结果比较

发布于 2024-10-26 20:28:03 字数 5149 浏览 0 评论 0原文

我正在尝试分析明尼苏达大学 IPUMS 数据集的1990 年美国人口普查在 R 中。我正在使用 调查 包，因为数据是加权。仅获取家庭数据（并忽略个人变量以保持简单），我尝试计算 hh Revenue 的平均值 (家庭收入）。为此，我使用调查设计对象 “svydesign 函数的文档”>svydesign() 函数包含以下代码：

> require(foreign)
> ipums.household <- read.dta("/path/to/stata_export.dta")
> ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing 
> ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household)
> svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE)
      mean     SE
[1,] 37029 17.365

到目前为止一切顺利。但是，如果我尝试在 Stata 中进行相同的计算（使用用于同一数据集不同部分的代码）：

use "C:\I\Hate\Backslashes\stata_export.dta"
replace hhincome = . if hhincome == 9999999
(933734 real changes made, 933734 to missing)

mean hhincome [fweight = hhwt] # The code from the link above.

Mean estimation                     Number of obs    = 91746420

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    hhincome |   37028.99   3.542749      37022.05    37035.94
--------------------------------------------------------------

并且，调查的作者在寻找另一种给这只猫剥皮的方法，有频率加权的建议：

expanded.data<-as.data.frame(lapply(compressed.data,
               function(x) rep(x,compressed.data$weights)))

但是，我似乎无法让这段代码工作：

> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt)
> expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt)))
Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument

我似乎无法修复。这可能与此问题。

总而言之：

为什么我在 Stata 和 R 中没有得到相同的答案？
哪一个是正确的（或者我在这两种情况下都做错了什么）？
假设我让 rep() 解决方案正常工作，这会复制 Stata 的结果吗？
正确的做法是什么？如果答案允许我使用 plyr 用于执行任意计算的包，而不是仅限于 survey 中实现的函数（svymean()、svyglm() 等）

更新

因此，在我通过电子邮件从 IPUMS 获得了出色的帮助后，我使用以下代码来正确处理调查权重。我在这里描述以防其他人将来遇到这个问题。

初始 Stata 准备

由于 IPUMS 目前未发布用于将数据导入 R 的脚本，因此您需要从 Stata 开始、SAS 或 SPSS。我现在会继续使用 Stata。首先从 IPUMS 运行导入脚本。然后，在继续添加以下变量之前：

generate strata = statefip*100000 + puma

这将为每个 PUMA 创建一个 240001 形式的唯一整数，前两位数字作为州 fip 代码（马里兰州为 24），后四位数字为PUMA id 在每个州都是唯一的。如果您要使用 R 您可能还会发现运行它也很有帮助。

generate statefip_num = statefip * 1

这将创建一个没有标签的附加变量，因为导入 .dta< /code> 文件到 R 应用标签并丢失底层整数。

Stata 和 `svyset`

正如 Keith 所解释的，调查抽样是由 Stata 通过调用 svyset 来处理的。

对于个人层面的分析，我现在使用：

svyset serial [pweight=perwt], strata(strata)

这将权重设置为 perwt，将分层设置为我们上面创建的变量，并使用家庭serial 号码来说明聚类。如果我们使用多年，我们可能还想尝试

generate double yearserial = year*100000000 + serial

考虑纵向聚类。

对于家庭层面的分析（没有年份）：

svyset serial [pweight=hhwt], strata(strata)

应该是不言自明的（尽管我认为在这种情况下序列实际上是多余的）。将 serial 替换为 yearserial 将考虑时间序列。

在 `R` 中执行此操作

假设您正在导入一个 .dta 文件，其中包含上面解释的附加 strata 变量，并在单个字母处进行分析：

require(foreign)
ipums <- read.dta('/path/to/data.dta')
require(survey)
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)

或者家庭层面：

ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt)

希望有人觉得这有帮助，非常感谢 IPUMS 的 Dwin、Keith 和 Brandon。

原文

I'm trying to analyze data from the University of Minnesota IPUMS dataset for the 1990 US census in R. I'm using the survey package because the data is weighted. Just taking the household data (and ignoring the person variables to keep things simple), I am attempting to calculate the mean of hhincome (household income). To do this I created a survey design object using the svydesign() function with the following code:

> require(foreign)
> ipums.household <- read.dta("/path/to/stata_export.dta")
> ipums.household[ipums.household$hhincome==9999999, "hhincome"] <- NA # Fix missing 
> ipums.hh.design <- svydesign(id=~1, weights=~hhwt, data=ipums.household)
> svymean(ipums.household$hhincome, ipums.hh.design, na.rm=TRUE)
      mean     SE
[1,] 37029 17.365

So far so good. However, I get a different standard error if I attempt the same calculation in Stata (using code meant for a different portion of the same dataset):

use "C:\I\Hate\Backslashes\stata_export.dta"
replace hhincome = . if hhincome == 9999999
(933734 real changes made, 933734 to missing)

mean hhincome [fweight = hhwt] # The code from the link above.

Mean estimation                     Number of obs    = 91746420

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    hhincome |   37028.99   3.542749      37022.05    37035.94
--------------------------------------------------------------

And, looking at another way to skin this cat, the author of survey, has this suggestion for frequency weighting:

expanded.data<-as.data.frame(lapply(compressed.data,
               function(x) rep(x,compressed.data$weights)))

However, I can't seem to get this code to work:

> hh.dataframe <- data.frame(ipums.household$hhincome, ipums.household$hhwt)
> expanded.hh.dataframe <- as.data.frame(lapply(hh.dataframe, function(x) rep(x, hh.dataframe$hhwt)))
Error in rep(x, hh.dataframe$hhwt) : invalid 'times' argument

Which I can't seem to fix. This may be related to this issue.

So in sum:

Why don't I get the same answers in Stata and R?
Which one is right (or am I doing something wrong in both cases)?
Assuming I got the rep() solution working, would that replicate Stata's results?
What's the right way to do it? Kudos if the answer allows me to use the plyr package for doing arbitrary calculations, rather than being limited to the functions implemented in survey (svymean(), svyglm() etc.)

Update

So after the excellent help I've received here and from IPUMS via email, I'm using the following code to properly handle survey weighting. I describe here in case someone else has this problem in future.

Initial Stata Preparation

Since IPUMS don't currently publish scripts for importing their data into R, you'll need to start from Stata, SAS, or SPSS. I'll stick with Stata for now. Begin by running the import script from IPUMS. Then before continuing add the following variable:

generate strata = statefip*100000 + puma

This creates a unique integer for each PUMA of the form 240001, with first two digits as the state fip code (24 in the case of Maryland) and the last four a PUMA id which is unique on a per state basis. If you're going to use R you might also find it helpful to run this as well

generate statefip_num = statefip * 1

This will create an additional variable without labels, since importing .dta files into R apply the labels and lose the underlying integers.

Stata and `svyset`

As Keith explained, survey sampling is handled by Stata by invoking svyset.

For an individual level analysis I now use:

svyset serial [pweight=perwt], strata(strata)

This sets the weighting to perwt, the stratification to the variable we created above, and uses the household serial number to account for clustering. If we were using multiple years, we might want to try

generate double yearserial = year*100000000 + serial

to account for longitudinal clustering as well.

For household level analysis (without years):

svyset serial [pweight=hhwt], strata(strata)

Should be self-explanatory (though I think in this case serial is actually superfluous). Replacing serial with yearserial will take into account a time series.

Doing it in `R`

Assuming you're importing a .dta file with the additional strata variable explained above and analysing at the individual letter:

require(foreign)
ipums <- read.dta('/path/to/data.dta')
require(survey)
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)

Or at the household level:

ipums.hh.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=hhwt)

Hope someone finds this helpful, and thanks so much to Dwin, Keith and Brandon from IPUMS.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

锦上情书 2024-11-02 20:28:03

1&2) 您引用的 Lumley 评论是 2001 年写的，早于他发表的调查包的任何作品，而调查包才发布了几年。您可能在两种不同的意义上使用“权重”。（Lumley 在他的书的早期描述了三种可能的含义。）调查函数 svydesign 使用概率权重而不是频率权重。考虑到该数据集的巨大规模，这些似乎并不是真正的频率权重，而是概率权重，这意味着调查包结果是正确的，而 Stata 结果是错误的。如果您不相信，那么 Survey 包提供了函数 as.svrepdesign()，Lumley 的书中描述了如何从 svydesign 对象创建复制权重向量。

3）我认为是这样，但正如 RMN 所说......“这是错误的。”

4）因为它是错误的（IMO），所以没有必要。

回复收藏 0 原文

九厘米的零° 2024-11-02 20:28:03

您不应该在 Stata 中使用频率权重。这很清楚。如果 IPUMS 没有“复杂”的调查设计，您可以使用：

mean hhincome [pw = hhwt]

或者，为了方便起见：

svyset [pw = hhwt]
svy: mean hhincome
svy: regress hhincome `x'

第二个选项的好处是您可以将它用于更复杂的调查设计（通过 svyset。然后你就可以运行很多命令，而不必一直输入 [pw...]。

You shouldn't be using frequency weights in Stata. That is pretty clear. If IPUMS doesn't have a "complex" survey design, you can just use:

mean hhincome [pw = hhwt]

Or, for convenience:

svyset [pw = hhwt]
svy: mean hhincome
svy: regress hhincome `x'

What's nice about the second option is that you can use it for more complex survey designs (via options on svyset. Then you can run lots of commands without having to typ [pw...] all the time.

回复收藏 0 原文

意中人 2024-11-02 20:28:03

对于无法访问 Stata 或 SAS 的人来说，略有增加；（我会把这个放在评论中，但是......）
SAScii库可以使用SAS代码文件读入IPUMS下载的数据。读取数据的代码来自 doc

library(SAScii)
IPUMS.file.location <- "..\\usa_00007dat\\usa_00007.dat"
IPUMS.SAS.read.in.instructions <- "..\\usa_00007dat\\usa_00007.sas"

#store the IPUMS extract as an R data frame!
IPUMS.df <- 
  read.SAScii ( 
    IPUMS.file.location , 
    IPUMS.SAS.read.in.instructions , 
    zipped = F )

Slight addition for people who don't have access to Stata or SAS; (I would put this in comments but...)
The library SAScii can use the SAS code file to read in the IPUMS downloaded data. The code to read in the data is from the doc

library(SAScii)
IPUMS.file.location <- "..\\usa_00007dat\\usa_00007.dat"
IPUMS.SAS.read.in.instructions <- "..\\usa_00007dat\\usa_00007.sas"

#store the IPUMS extract as an R data frame!
IPUMS.df <- 
  read.SAScii ( 
    IPUMS.file.location , 
    IPUMS.SAS.read.in.instructions , 
    zipped = F )

回复收藏 0 原文

~没有更多了~

关于作者

壹場煙雨

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

R 中的频率加权，与 Stata 的结果比较

更新

初始 Stata 准备

Stata 和 `svyset`

在 `R` 中执行此操作

Update

Initial Stata Preparation

Stata and `svyset`

Doing it in `R`

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

R 中的频率加权，与 Stata 的结果比较

更新

初始 Stata 准备

Stata 和 svyset

在 R 中执行此操作

Update

Initial Stata Preparation

Stata and svyset

Doing it in R

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

Stata 和 `svyset`

在 `R` 中执行此操作

Stata and `svyset`

Doing it in `R`

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。