如何在 R 中创建、构建、维护和更新数据码本？

发布于 2024-10-22 05:42:49 字数 1406 浏览 11 评论 0原文

为了复制的目的，我喜欢为每个数据帧保留一个包含元数据的密码本。数据码本是：

书面或计算机化的列表，提供将包含在数据库中的变量的清晰且全面的描述。 Marczyk 等人 (2010)

我喜欢记录变量的以下属性：

姓名
描述（标签、格式、比例等）
来源（例如世界银行）
源媒体（网址和访问日期、CD 和 ISBN 等）
磁盘上源数据的文件名（合并码本时有帮助）
注释

例如，这就是我正在实现的，用 8 个变量记录数据帧 mydata1 中的变量：

code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
     label=c("Label 1",
              "State name",
              "Personal identifier",
              "Income per capita, thousand of US$, constant year 2000 prices",
              "Unique id",
              "Calendar year",
              "blah",
              "bah"),
      source=rep("unknown",length(mydata1)),
      source_media=rep("unknown",length(mydata1)),
      filename = rep("unknown",length(mydata1)),
      notes = rep("unknown",length(mydata1))
)

我为我读取的每个数据集编写一个不同的代码本。当我合并数据帧时，我还将合并其关联码本的相关方面，以记录最终的数据库。我基本上是通过复制粘贴上面的代码并更改参数来做到这一点的。

原文

In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:

a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010)

I like to document the following attributes of a variable:

name
description (label, format, scale, etc)
source (e.g. World bank)
source media (url and date accessed, CD and ISBN, or whatever)
file name of the source data on disk (helps when merging codebooks)
notes

For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:

code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
     label=c("Label 1",
              "State name",
              "Personal identifier",
              "Income per capita, thousand of US$, constant year 2000 prices",
              "Unique id",
              "Calendar year",
              "blah",
              "bah"),
      source=rep("unknown",length(mydata1)),
      source_media=rep("unknown",length(mydata1)),
      filename = rep("unknown",length(mydata1)),
      notes = rep("unknown",length(mydata1))
)

I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

洒一地阳光 2024-10-29 05:42:49

您可以使用 attr 函数向任何 R 对象添加任何特殊属性。例如：

x <- cars
attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

并在对象结构中查看给定的属性：

> str(x)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
 - attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

并且还可以使用相同的 attr 函数加载指定的属性：

> attr(x, "source")
[1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

如果您仅向数据框中添加新案例，则给定的属性将不会受到影响（请参阅：str(rbind(x,x))，同时更改结构将删除给定的属性（请参阅：str(cbind(x,x))） .

更新：基于注释

如果要列出所有非标准属性，请检查以下内容：

setdiff(names(attributes(x)),c("names","row.names","class"))

这将列出所有非标准属性（标准为：数据框中的名称、行名称、类））

基于此，您可以编写一个简短的函数来列出所有非标准属性以及值。以下内容确实有效，尽管不是以一种简洁的方式...您可以

首先改进它并组成一个函数:) ，定义 uniqe（=非标准）属性：

uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))

并创建一个包含名称和值的矩阵：

attribs <- matrix(0,0,2)

循环遍历非标准属性并将名称和值保存在矩阵中：

for (i in 1:length(uniqueattrs)) {
    attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
}

将矩阵转换为数据框并命名columns：

attribs <- as.data.frame(attribs)
names(attribs) <- c('name', 'value')

并以任何格式保存，例如：

write.csv(attribs, 'foo.csv')

对于有关变量标签的问题，请检查 foreign 包中的 read.spss 函数，因为它正是您所需要的：将值标签保存在 attrs 部分。主要思想是attr可以是数据框或其他对象，因此您不需要为每个变量创建一个唯一的“attr”，而只需创建一个（例如命名为“变量标签”）并将所有信息保存在那里。您可以这样调用： attr(x, "variable.labels")['foo'] 其中 'foo' 代表所需的变量名称。但请检查上面引用的函数以及导入的数据框的属性以获取更多详细信息。

我希望这些可以帮助您以比我上面尝试的更简洁的方式编写所需的函数！ :)

You could add any special attribute to any R object with the attr function. E.g.:

x <- cars
attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And see the given attribute in the structure of the object:

> str(x)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
 - attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And could also load the specified attribute with the same attr function:

> attr(x, "source")
[1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

If you only add new cases to your data frame, the given attribute will not be affected (see: str(rbind(x,x)) while altering the structure will erease the given attributes (see: str(cbind(x,x))).

UPDATE: based on comments

If you want to list all non-standard attributes, check the following:

setdiff(names(attributes(x)),c("names","row.names","class"))

This will list all non-standard attributes (standard are: names, row.names, class in data frames).

Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)

First, define the uniqe (=non standard) attributes:

uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))

And make a matrix which will hold the names and values:

attribs <- matrix(0,0,2)

Loop through the non-standard attributes and save in the matrix the names and values:

for (i in 1:length(uniqueattrs)) {
    attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
}

Convert the matrix to a data frame and name the columns:

attribs <- as.data.frame(attribs)
names(attribs) <- c('name', 'value')

And save in any format, eg.:

write.csv(attribs, 'foo.csv')

To your question about the variable labels, check the read.spss function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: attr(x, "variable.labels")['foo'] where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.

I hope these could help you to write the required functions in a lot neater way than I tried above! :)

回复收藏 0 原文

じее 2024-10-29 05:42:49

更高级的版本是使用 S4 类。例如，在bioconductor中 ExpressionSet用于存储微阵列数据及其相关的实验元数据。

第 4.4 节中描述的 MIAME 对象，看起来与您所追求的非常相似：

experimentData <- new("MIAME", name = "Pierre Fermat",
          lab = "Francis Galton Lab", contact = "[email protected]",
          title = "Smoking-Cancer Experiment", abstract = "An example ExpressionSet",
          url = "www.lab.not.exist", other = list(notes = "Created from text files"))

A more advanced version would be to use S4 classes. For example, in bioconductor the ExpressionSet is used to store microarray data with its associated experimental meta data.

The MIAME object described in Section 4.4, looks very similar to what you are after:

experimentData <- new("MIAME", name = "Pierre Fermat",
          lab = "Francis Galton Lab", contact = "[email protected]",
          title = "Smoking-Cancer Experiment", abstract = "An example ExpressionSet",
          url = "www.lab.not.exist", other = list(notes = "Created from text files"))

回复收藏 0 原文

梦萦几度 2024-10-29 05:42:49

comment() 函数在这里可能很有用。它可以设置和查询对象的注释属性，但具有不打印其他普通属性的优点。

dat <- data.frame(A = 1:5, B = 1:5, C = 1:5)
comment(dat$A) <- "Label 1"
comment(dat$B) <- "Label 2"
comment(dat$C) <- "Label 3"
comment(dat) <- "data source is, sampled on 1-Jan-2011"

给出：

> dat
  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
> dat$A
[1] 1 2 3 4 5
> comment(dat$A)
[1] "Label 1"
> comment(dat)
[1] "data source is, sampled on 1-Jan-2011"

合并示例：

> dat2 <- data.frame(D = 1:5)
> comment(dat2$D) <- "Label 4"
> dat3 <- cbind(dat, dat2)
> comment(dat3$D)
[1] "Label 4"

但这失去了对 dat() 的注释：

> comment(dat3)
NULL

因此这些类型的操作需要显式处理。为了真正做到您想要的，您可能需要编写您使用的函数的特殊版本，以在提取/合并操作期间维护注释/元数据。或者，您可能想考虑生成自己的对象类 - 例如包含数据框和其他保存元数据的组件的列表。然后为您想要保留元数据的函数编写方法。

这些方面的一个例子是动物园包，它为时间序列生成一个列表对象，其中包含保存排序和时间/日期信息等的额外组件，但从子集等的角度来看仍然像普通对象一样工作，因为作者已经提供了[ 等函数的方法

The comment() function might be useful here. It can set and query a comment attribute on an object, but has the advantage other normal attributes of not being printed.

dat <- data.frame(A = 1:5, B = 1:5, C = 1:5)
comment(dat$A) <- "Label 1"
comment(dat$B) <- "Label 2"
comment(dat$C) <- "Label 3"
comment(dat) <- "data source is, sampled on 1-Jan-2011"

which gives:

> dat
  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
> dat$A
[1] 1 2 3 4 5
> comment(dat$A)
[1] "Label 1"
> comment(dat)
[1] "data source is, sampled on 1-Jan-2011"

Example of merging:

> dat2 <- data.frame(D = 1:5)
> comment(dat2$D) <- "Label 4"
> dat3 <- cbind(dat, dat2)
> comment(dat3$D)
[1] "Label 4"

but that looses the comment on dat():

> comment(dat3)
NULL

so those sorts of operations would need handling explicitly. To truly do what you want, you'll probably either need to write special versions of functions you use that maintain the comments/metadata during extraction/merge operations. Alternatively you might want to look into producing your own classes of objects - say as a list with a data frame and other components holding the metadata. Then write methods for the functions you want that preserve the meta data.

An example along these lines is the zoo package which generates a list object for a time series with extra components holding the ordering and time/date info etc, but still works like a normal object from point of view of subsetting etc because the authors have provided methods for functions like [ etc.

回复收藏 0 原文

柏拉图鍀咏恒 2024-10-29 05:42:49

截至 2020 年，已有直接专用于代码本的 R 软件包可以满足您的需求。

codebooks软件包是一个综合软件包，可以生成不同格式的码本（具有公共属性和描述性统计）。它有一个网站和一篇论文（Arslan，2019，如何使用codebook包自动记录数据以促进数据重用。如图1所示，论文还对不同的方法。
这是一个示例。
dataspice 软件包（由 rOpenSci 提供）特别致力于生成可由网络搜索引擎找到的元数据。它有一个网站。
这是一个示例。
dataMaid包可以生成包含元数据和描述性统计数据的报告，并且可以执行某些检查。它位于 CRAN 和 GitHub 上，并且有一篇 JSS 论文（Petersen & Ekstrøm，2019，dataMaid：您在 R 中记录监督数据质量筛选的助手） .
这是一个示例。
memisc 软件包具有许多用于处理调查数据的功能，并且还附带密码本功能。它有一个网站。
这是一个示例。
还有一篇 Marta Kołczyńska 的博客文章，其中包含一个轻量级函数，可以生成带有元数据的数据框（可以导出到 Excel 文件等）。
这是一个示例。

回复收藏 0 原文

栀梦 2024-10-29 05:42:49

我的做法有点不同，而且技术含量明显较低。我通常遵循这样的指导原则：如果文本的设计目的不是对计算机有意义而仅对人类有意义，那么它就属于源代码中的注释。

这可能感觉相当“低科技”，但这样做有一些充分的理由：

当其他人将来拿起你的代码时，很直观地发现这些注释明确是供他们阅读的。在数据结构中不寻常位置设置的参数对于未来的用户来说可能并不明显。
跟踪抽象对象内部的参数集需要相当多的纪律。创建代码注释也需要纪律，但是注释的缺失是显而易见的。如果描述是作为对象的一部分进行的，那么浏览一下代码并不会让这一点变得显而易见。然后，从“文学编程”这个词的意义上来说，代码就变得不那么“文学”了。
在数据对象内部携带数据描述很容易导致描述不正确。例如，如果将包含千克测量值的列乘以 2.2 将单位转换为磅，就会发生这种情况。人们很容易忽视更新元数据的需要。

显然，与对象一起携带元数据有一些真正的优势。如果您的工作流程使上述几点不再那么密切相关，那么为您的数据结构创建元数据附件可能会很有意义。我的目的只是分享一些可能考虑基于“较低技术”评论的方法的原因。

回复收藏 0 原文

~没有更多了~