我现在正在做一个项目,我一直在慢慢地从许多不同的来源积累一堆不同的变量。作为一个有点聪明的人,我在主“original_data”目录下为每个目录创建了一个不同的子目录,并包含一个 .txt 文件,其中包含我从中获取数据的 URL 和其他描述符。作为一个不够聪明的人,这些 .txt 文件没有结构。
现在我面临着编译方法部分的任务,该部分记录了所有不同的数据源。我愿意仔细检查数据并为其添加结构,但随后我需要找到或构建一个报告工具来扫描目录并提取信息。
这似乎是 ProjectTemplate
已经具备的功能,但我似乎找不到该功能。
存在这样的工具吗?
如果没有,应考虑哪些因素才能提供最大的灵活性?一些初步的想法:
- 应该使用标记语言(YAML?)
- 应该扫描所有子目录
- 为了促进(2),应该使用数据集描述符的标准扩展
- 关键的是,为了使其最有用,需要某种方法将变量描述符与它们最终采用的名称相匹配。因此,要么所有变量重命名都必须在源文件中完成,而不是在清理步骤中完成(不太理想),文档引擎必须完成一些代码解析以跟踪变量名称更改(呃!),或者某些应该使用更简单的混合,例如允许在标记文件中指定变量重命名。
- 理想情况下,报告也将被模板化(例如“我们在[日期]从[dset]数据集中提取[var]变量。”),并且可能链接到Sweave。
- 该工具应该足够灵活,不会过于繁重。这意味着最少的文档只是数据集名称。
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
This seems like something that ProjectTemplate
would have already, but I can't seem to find that functionality there.
Does such a tool exist?
If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:
- A markup language should be used (YAML?)
- All sub-directories should be scanned
- To facilitate (2), a standard extension for a dataset descriptor should be used
- Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
- Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
- The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
发布评论
评论(1)
这是一个非常好的问题:人们应该非常关心构成统计结果基础的所有数据收集、聚合、转换等序列。不幸的是,这并没有被广泛实践。
在回答您的问题之前,我想强调一下,这似乎与管理数据来源的总体目标非常相关。我不妨给您一个 Google 链接来了解更多信息。 :) 您会发现很多资源,例如调查、软件工具(例如维基百科条目中列出的一些)、各种研究项目(例如 出处挑战)等。
这是一个概念性的开始,现在要解决实际问题:
欢迎来到每个人的噩梦。 :)
没问题。
list.files(...,recursive = TRUE)
可能会成为一个好朋友;另请参阅R.utils
中的listDirectory()
。值得注意的是,填写数据源的方法部分是数据来源中的狭义应用。事实上,相当不幸的是,CRAN 任务视图可重复性研究只关注文档。根据我的经验,数据来源的目标是可重复研究的一个子集,而数据操作和结果的文档是数据来源的一个子集。因此,这种任务观点对于可重复性研究仍处于起步阶段。它可能对你的目标有用,但你最终会超越它。 :)
是的。此类工具是什么?我的天……总的来说,它非常以应用程序为中心。在 R 中,我认为这些工具没有受到太多关注(*见下文)。这是相当不幸的——要么是我遗漏了一些东西,要么是 R 社区遗漏了我们应该使用的东西。
对于您描述的基本过程,我通常使用 JSON (请参阅 这个答案对我正在做的事情发表评论) 。在我的大部分工作中,我将其表示为“数据流模型”(顺便说一句,该术语可能含糊不清,特别是在计算背景下,但我的意思是从统计分析的角度来看)。在许多情况下,此流程是通过 JSON 进行描述的,因此从 JSON 中提取序列来解决特定结果的产生方式并不困难。
对于更复杂或受监管的项目,JSON 是不够的,我使用数据库来定义数据的收集、转换等方式。对于受监管的项目,数据库中可能有大量的身份验证、日志记录等,以确保数据出处有据可查。我怀疑这种数据库远远超出了你的兴趣,所以让我们继续......
坦率地说,无论您需要什么来描述数据流都足够了。大多数时候,我发现拥有良好的 JSON、良好的数据目录布局和良好的脚本排序就足够了。
完成:
listDirectory()
Trivial:“.json”。 ;-) 或者“.SecretSauce”也可以。
如前所述,这不太合理。假设我采用
var1
和var2
,并创建var3
和var4
。也许var4
只是var2
到其分位数的映射,var3
是var1
的观察最大值,并且var2
;或者我可以通过截断极值从var2
创建var4
。如果这样做,我会保留var2
的名称吗?另一方面,如果您指的是简单地将“长名称”与“简单名称”(即 R 变量的文本描述符)匹配,那么这只有您可以做到。如果您有非常结构化的数据,那么创建与变量名称匹配的文本名称列表并不困难;或者,您可以创建可以执行字符串替换的标记。我认为创建一个将变量名称与描述符相匹配的 CSV(或者更好的是 JSON;-))并不难。只需继续检查所有变量是否具有匹配的描述符字符串,并在完成后停止。这就是其他人对
roxygen
和roxygen2
的建议可以应用的地方。嗯,我被难住了。 :)
(*) 顺便说一句,如果您想要一个与此相关的 FOSS 项目,请查看 Taverna。正如多处记录的那样,它已与 R 集成。这对于您目前的需求来说可能有点过分了,但作为一个相当成熟的工作流系统的示例,它值得研究。
注1:因为我经常使用
bigmemory
来存储大型数据集,所以我必须为每个矩阵的列命名。它们存储在每个二进制文件的描述符文件中。该过程鼓励创建将变量名称(和矩阵)与描述符相匹配的描述符。如果您将数据存储在数据库或其他支持随机访问和多个 R/W 访问的外部文件(例如内存映射文件、HDF5 文件、除 .rdat 文件之外的任何文件)中,您可能会发现添加描述符成为第二天性。This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.
Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.
That's a conceptual start, now to address practical issues:
Welcome to everyone's nightmare. :)
No problem.
list.files(...,recursive = TRUE)
might become a good friend; see alsolistDirectory()
inR.utils
.It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)
Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.
For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.
For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...
Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.
Done:
listDirectory()
Trivial: ".json". ;-) Or ".SecretSauce" works, too.
As stated, this doesn't quite make sense. Suppose that I take
var1
andvar2
, and createvar3
andvar4
. Perhapsvar4
is just a mapping ofvar2
to its quantiles andvar3
is the observation-wise maximum ofvar1
andvar2
; or I might createvar4
fromvar2
by truncating extreme values. If I do so, do I retain the name ofvar2
? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.That's where others' suggestions of
roxygen
androxygen2
can apply.Hmm, I'm stumped here. :)
(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.
Note 1: Because I frequently use
bigmemory
for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.