如何在 R 中创建、构建、维护和更新数据码本?
为了复制的目的,我喜欢为每个数据帧保留一个包含元数据的密码本。数据码本是:
书面或计算机化的列表,提供将包含在数据库中的变量的清晰且全面的描述。 Marczyk 等人 (2010)
我喜欢记录变量的以下属性:
- 姓名
- 描述(标签、格式、比例等)
- 来源(例如世界银行)
- 源媒体(网址和访问日期、CD 和 ISBN 等)
- 磁盘上源数据的文件名(合并码本时有帮助)
- 注释
例如,这就是我正在实现的,用 8 个变量记录数据帧 mydata1 中的变量:
code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
label=c("Label 1",
"State name",
"Personal identifier",
"Income per capita, thousand of US$, constant year 2000 prices",
"Unique id",
"Calendar year",
"blah",
"bah"),
source=rep("unknown",length(mydata1)),
source_media=rep("unknown",length(mydata1)),
filename = rep("unknown",length(mydata1)),
notes = rep("unknown",length(mydata1))
)
我为我读取的每个数据集编写一个不同的代码本。当我合并数据帧时,我还将合并其关联码本的相关方面,以记录最终的数据库。我基本上是通过复制粘贴上面的代码并更改参数来做到这一点的。
In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:
a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010)
I like to document the following attributes of a variable:
- name
- description (label, format, scale, etc)
- source (e.g. World bank)
- source media (url and date accessed, CD and ISBN, or whatever)
- file name of the source data on disk (helps when merging codebooks)
- notes
For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:
code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
label=c("Label 1",
"State name",
"Personal identifier",
"Income per capita, thousand of US$, constant year 2000 prices",
"Unique id",
"Calendar year",
"blah",
"bah"),
source=rep("unknown",length(mydata1)),
source_media=rep("unknown",length(mydata1)),
filename = rep("unknown",length(mydata1)),
notes = rep("unknown",length(mydata1))
)
I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用
attr
函数向任何 R 对象添加任何特殊属性。例如:并在对象结构中查看给定的属性:
并且还可以使用相同的
attr
函数加载指定的属性:如果您仅向数据框中添加新案例,则给定的属性将不会受到影响(请参阅:
str(rbind(x,x))
,同时更改结构将删除给定的属性(请参阅:str(cbind(x,x))
) .更新:基于注释
如果要列出所有非标准属性,请检查以下内容:
这将列出所有非标准属性(标准为:数据框中的名称、行名称、类) )
基于此,您可以编写一个简短的函数来列出所有非标准属性以及值。以下内容确实有效,尽管不是以一种简洁的方式...您可以
首先 改进它并组成一个函数:) ,定义 uniqe(=非标准)属性:
并创建一个包含名称和值的矩阵:
循环遍历非标准属性并将名称和值保存在矩阵中:
将矩阵转换为数据框并命名columns:
并以任何格式保存,例如:
对于有关变量标签的问题,请检查 foreign 包中的
read.spss
函数,因为它正是您所需要的:将值标签保存在 attrs 部分。主要思想是attr可以是数据框或其他对象,因此您不需要为每个变量创建一个唯一的“attr”,而只需创建一个(例如命名为“变量标签”)并将所有信息保存在那里。您可以这样调用:attr(x, "variable.labels")['foo']
其中 'foo' 代表所需的变量名称。但请检查上面引用的函数以及导入的数据框的属性以获取更多详细信息。我希望这些可以帮助您以比我上面尝试的更简洁的方式编写所需的函数! :)
You could add any special attribute to any R object with the
attr
function. E.g.:And see the given attribute in the structure of the object:
And could also load the specified attribute with the same
attr
function:If you only add new cases to your data frame, the given attribute will not be affected (see:
str(rbind(x,x))
while altering the structure will erease the given attributes (see:str(cbind(x,x))
).UPDATE: based on comments
If you want to list all non-standard attributes, check the following:
This will list all non-standard attributes (standard are: names, row.names, class in data frames).
Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)
First, define the uniqe (=non standard) attributes:
And make a matrix which will hold the names and values:
Loop through the non-standard attributes and save in the matrix the names and values:
Convert the matrix to a data frame and name the columns:
And save in any format, eg.:
To your question about the variable labels, check the
read.spss
function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like:attr(x, "variable.labels")['foo']
where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.I hope these could help you to write the required functions in a lot neater way than I tried above! :)
更高级的版本是使用 S4 类。例如,在bioconductor中 ExpressionSet用于存储微阵列数据及其相关的实验元数据。
第 4.4 节中描述的 MIAME 对象,看起来与您所追求的非常相似:
A more advanced version would be to use S4 classes. For example, in bioconductor the ExpressionSet is used to store microarray data with its associated experimental meta data.
The MIAME object described in Section 4.4, looks very similar to what you are after:
comment()
函数在这里可能很有用。它可以设置和查询对象的注释属性,但具有不打印其他普通属性的优点。给出:
合并示例:
但这失去了对
dat()
的注释:因此这些类型的操作需要显式处理。为了真正做到您想要的,您可能需要编写您使用的函数的特殊版本,以在提取/合并操作期间维护注释/元数据。或者,您可能想考虑生成自己的对象类 - 例如包含数据框和其他保存元数据的组件的列表。然后为您想要保留元数据的函数编写方法。
这些方面的一个例子是动物园包,它为时间序列生成一个列表对象,其中包含保存排序和时间/日期信息等的额外组件,但从子集等的角度来看仍然像普通对象一样工作,因为作者已经提供了
[
等函数的方法The
comment()
function might be useful here. It can set and query a comment attribute on an object, but has the advantage other normal attributes of not being printed.which gives:
Example of merging:
but that looses the comment on
dat()
:so those sorts of operations would need handling explicitly. To truly do what you want, you'll probably either need to write special versions of functions you use that maintain the comments/metadata during extraction/merge operations. Alternatively you might want to look into producing your own classes of objects - say as a list with a data frame and other components holding the metadata. Then write methods for the functions you want that preserve the meta data.
An example along these lines is the zoo package which generates a list object for a time series with extra components holding the ordering and time/date info etc, but still works like a normal object from point of view of subsetting etc because the authors have provided methods for functions like
[
etc.截至 2020 年,已有直接专用于代码本的 R 软件包可以满足您的需求。
codebooks软件包是一个综合软件包,可以生成不同格式的码本(具有公共属性和描述性统计)。它有一个 网站 和一篇论文(Arslan,2019,如何使用codebook包自动记录数据以促进数据重用。如图1所示,论文还对不同的方法。
这是一个示例。
dataspice 软件包(由 rOpenSci 提供)特别致力于生成可由网络搜索引擎找到的元数据。它有一个网站。
这是一个 示例。
dataMaid包可以生成包含元数据和描述性统计数据的报告,并且可以执行某些检查。它位于 CRAN 和 GitHub 上,并且有一篇 JSS 论文(Petersen & Ekstrøm,2019,dataMaid:您在 R 中记录监督数据质量筛选的助手) .
这是一个 示例。
memisc 软件包具有许多用于处理调查数据的功能,并且还附带密码本功能。它有一个网站。
这是一个示例。
还有一篇 Marta Kołczyńska 的博客文章,其中包含一个轻量级函数,可以生成带有元数据的数据框(可以导出到 Excel 文件等)。
这是一个示例。
As of 2020, there are R packages directly dedicated to codebooks that may fit your needs.
The codebooks package is a comprehensive package that can generate codebooks (with common attributes plus descriptive statistics) in different formats. It has a website and a paper (Arslan, 2019, How to Automatically Document Data With the codebook Package to Facilitate Data Reuse. The paper has, in Figure 1, also a comparison of different approaches.
Here is an example.
The dataspice package (featured by rOpenSci) is particularly dedicated to generating metadata that can be found by search engines on the web. It has a website.
Here is an example.
The dataMaid package can generate a report containing metadata and descriptive statistics, and it can perform certain checks. It's on CRAN and GitHub, and it has a JSS paper (Petersen & Ekstrøm, 2019, dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R).
Here is an example.
The memisc package has a lot of functionality for working with survey data and also comes with a codebook function. It has a website.
Here is an example.
There is also a blog post by Marta Kołczyńska with a lightweight function that generates a data frame with metadata (which can be exported, e.g., to an Excel file).
Here is an example.
我的做法有点不同,而且技术含量明显较低。我通常遵循这样的指导原则:如果文本的设计目的不是对计算机有意义而仅对人类有意义,那么它就属于源代码中的注释。
这可能感觉相当“低科技”,但这样做有一些充分的理由:
显然,与对象一起携带元数据有一些真正的优势。如果您的工作流程使上述几点不再那么密切相关,那么为您的数据结构创建元数据附件可能会很有意义。我的目的只是分享一些可能考虑基于“较低技术”评论的方法的原因。
How I do this is a little different and markedly less technical. I generally follow the guiding principle that if text is not designed to be meaningful to the computer and only meaningful to humans, then it belongs in comments in the source code.
This may feel rather "low tech" but there are some good reasons to do this:
Obviously there are some real advantages to carrying metadata along with the objects. And if your workflow makes the above points less germane, then it may make a lot of sense to create a metadata attachment to your data structure. My intent was only to share some reasons why a "lower tech" comment based approach might be considered.