当与同事讨论性能、教学、发送错误报告或在邮件列表和 StackOverflow 上搜索指导时,可重现的示例 经常被问到并且总是有帮助。
创建优秀示例的技巧是什么?如何将 r 中的数据结构粘贴到文本中格式?您还应该包括哪些其他信息?
除了使用dput()
、dump()
或struction()
之外还有其他技巧吗?什么时候应该包含 library()
或 require()
语句?除了 c
、df
、data
等之外,还应该避免哪些保留字?
如何制作一个出色的 r 可重现示例?
When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on Stack Overflow, a reproducible example is often asked and always helpful.
What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include?
Are there other tricks in addition to using dput()
, dump()
or structure()
? When should you include library()
or require()
statements? Which reserved words should one avoid, in addition to c
, df
, data
, etc.?
How does one make a great r reproducible example?
发布评论
评论(23)
基本上,最小可重现示例 (MRE) 应该使其他人能够准确地 在他们的机器上重现您的问题。
请不要发布您的数据、代码或控制台输出的图像!
简要摘要
MRE 包含以下项目:
库必要信息 >s、R 版本及其运行的操作系统 ,可能是一个
sessionInfo()
set.seed()
设置),使其他人能够复制与您完全相同的结果有关良好 MRE 的示例,请参阅您正在使用的函数的帮助页面底部的“示例”部分。只需在 R 控制台中输入例如
help(mean)
或简短的?mean
即可。提供最小的数据集
通常,共享庞大的数据集是没有必要的,而且可能会阻止其他人阅读您的问题。因此,最好使用内置数据集或创建一个类似于原始数据的小“玩具”示例,这实际上就是最小的含义。如果由于某种原因您确实需要共享您的原始数据,您应该使用一种方法,例如 dput() ,该方法允许其他人获得您数据的精确副本。
内置数据集
您可以使用内置数据集之一。使用
data()
可以查看内置数据集的完整列表。每个数据集都有一个简短的描述,并且可以获取更多信息,例如使用?iris
,获取 R 附带的“iris”数据集。安装的包可能包含其他数据集。创建示例数据集
初步说明:有时您可能需要特殊格式(即类),例如因子、日期或时间序列。对于这些,请使用以下函数:
as.factor
、as.Date
、as.xts
、...示例:其中
向量
矩阵
数据框
注意:虽然它被广泛使用,但最好不要将数据框命名为
df
,因为df()
是 F 分布密度(即x
点曲线的高度)的 R 函数,您可能会与它发生冲突。复制原始数据
如果您有特定原因,或者数据太难构建示例,您可以提供原始数据的一小部分,最好使用
dput
。为什么使用
dput()
?dput
抛出在控制台上准确重现数据所需的所有信息。您只需复制输出并将其粘贴到您的问题中即可。调用
dat
(从上面)生成的输出仍然缺少有关变量类和其他功能的信息(如果您在问题中共享它)。此外,type
列中的空格使得很难对其执行任何操作。即使我们开始使用这些数据,我们也无法正确获取您数据的重要特征。对数据进行子集化
要共享子集,请使用
head()
、subset()
或索引iris[1:4, ]
。然后将其包装到 dput() 中,以便为其他人提供可以立即放入 R 中的东西。 示例在您的问题中分享的控制台输出:
使用
dput
时,您可能还希望仅包含相关列,例如 dput(mtcars[1 :3, c(2, 5, 6)])注意: 如果您的数据框具有多个级别的因子,则
dput
输出可能会很笨重,因为它仍然会列出所有可能的因素水平,即使它们不存在于数据子集中。要解决此问题,您可以使用droplevels()
函数。请注意下面的物种如何成为只有一个级别的因子,例如 dput(droplevels(iris[1:4, ])) 。dput
的另一个警告是,它不适用于键控data.table
对象或分组tbl_df
(类grouped_df
code>)来自tidyverse
。在这些情况下,您可以在共享之前转换回常规数据帧,dput(as.data.frame(my_data))
。考虑使用建设性包以获得更清晰的结果
使用
constructive::construct(iris[1:4, ])
相反上面的 dput(iris[1:4,]) 给出了这个输出,它更加紧凑并且更容易阅读(例如,带有重复因子值的长字符串的示例将给出使用construct()
的更有力理由 ...)生成最少的代码
与最少的数据(见上文)相结合,您的代码应该通过简单地复制和粘贴来在另一台计算机上准确地重现问题。
这应该是容易的部分,但通常并非如此。你不应该做的事情:
您应该做什么:
library()
))unlink()
)op <- par(mfrow=c(1,2)) ...some code... par(op)
)提供必要的信息
在大多数情况下,只需提供 R 版本和操作系统就足够了。当包发生冲突时,提供
sessionInfo()
的输出确实很有帮助。在谈论与其他应用程序的连接(无论是通过 ODBC 还是其他任何方式)时,还应该提供这些应用程序的版本号,如果可能的话,还应该提供有关设置的必要信息。如果您在 R Studio 中运行 R,使用 rstudioapi::versionInfo() 可以帮助报告您的 RStudio 版本。
如果您对特定包有问题,您可能需要通过给出
packageVersion("name of the package")
的输出来提供包版本。种子
使用
set.seed()
,您可以指定种子1,即R 的随机数生成器固定的特定状态。这使得随机函数(例如sample()
、rnorm()
、runif()
和许多其他函数)始终返回相同的结果,示例:1 注意:
set.seed()
的输出因R>3.6.0及之前的版本。指定您用于随机过程的 R 版本,如果您在遵循旧问题时得到略有不同的结果,请不要感到惊讶。要在这种情况下获得相同的结果,您可以在set.seed()
之前使用RNGversion()
函数(例如:RNGversion("3.5.2 “)
)。Basically, a minimal reproducible example (MRE) should enable others to exactly reproduce your issue on their machines.
Please do not post images of your data, code, or console output!
Brief summary
A MRE consists of the following items:
library
s, the R version, and the OS it is run on, perhaps asessionInfo()
set.seed()
) to enable others to replicate exactly the same results as you haveFor examples of good MREs, see section "Examples" at the bottom of help pages on the function you are using. Simply type e.g.
help(mean)
, or short?mean
into your R console.Providing a minimal dataset
Usually, sharing huge data sets is not necessary and may rather discourage others from reading your question. Therefore, it is better to use built-in datasets or create a small "toy" example that resembles your original data, which is actually what is meant by minimal. If for some reason you really need to share your original data, you should use a method, such as
dput()
, that allows others to get an exact copy of your data.Built-in datasets
You can use one of the built-in datasets. A comprehensive list of built-in datasets can be seen with
data()
. There is a short description of every data set, and more information can be obtained, e.g. with?iris
, for the 'iris' data set that comes with R. Installed packages might contain additional datasets.Creating example data sets
Preliminary note: Sometimes you may need special formats (i.e. classes), such as factors, dates, or time series. For these, make use of functions like:
as.factor
,as.Date
,as.xts
, ... Example:where
Vectors
Matrices
Data frames
Note: Although it is widely used, better to not name your data frame
df
, becausedf()
is an R function for the density (i.e. height of the curve at pointx
) of the F distribution and you might get a clash with it.Copying original data
If you have a specific reason, or data that would be too difficult to construct an example from, you could provide a small subset of your original data, best by using
dput
.Why use
dput()
?dput
throws all information needed to exactly reproduce your data on your console. You may simply copy the output and paste it into your question.Calling
dat
(from above) produces output that still lacks information about variable classes and other features if you share it in your question. Furthermore, the spaces in thetype
column make it difficult to do anything with it. Even when we set out to use the data, we won't manage to get important features of your data right.Subset your data
To share a subset, use
head()
,subset()
or the indicesiris[1:4, ]
. Then wrap it intodput()
to give others something that can be put in R immediately. ExampleConsole output to share in your question:
When using
dput
, you may also want to include only relevant columns, e.g. dput(mtcars[1:3, c(2, 5, 6)])Note: If your data frame has a factor with many levels, the
dput
output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the subset of your data. To solve this issue, you can use thedroplevels()
function. Notice below how species is a factor with only one level, e.g.dput(droplevels(iris[1:4, ]))
. One other caveat fordput
is that it will not work for keyeddata.table
objects or for groupedtbl_df
(classgrouped_df
) from thetidyverse
. In these cases you can convert back to a regular data frame before sharing,dput(as.data.frame(my_data))
.consider using the constructive package for cleaner results
Using
constructive::construct(iris[1:4,])
instead ofdput(iris[1:4,])
as above gives this output, which is a little bit more compact and easier to read (examples with, for example, long strings of repeated factor values will give an even stronger reason to useconstruct()
...)Producing minimal code
Combined with the minimal data (see above), your code should exactly reproduce the problem on another machine by simply copying and pasting it.
This should be the easy part but often isn't. What you should not do:
What you should do:
library()
)unlink()
)op <- par(mfrow=c(1,2)) ...some code... par(op)
)Providing necessary information
In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of
sessionInfo()
can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible, also the necessary information on the setup.If you are running R in R Studio, using
rstudioapi::versionInfo()
can help report your RStudio version.If you have a problem with a specific package, you may want to provide the package version by giving the output of
packageVersion("name of the package")
.Seed
Using
set.seed()
you may specify a seed1, i.e. the specific state in which R's random number generator is fixed. This makes it possible for random functions, such assample()
,rnorm()
,runif()
and lots of others, to always return the same result, Example:1 Note: The output of
set.seed()
differs between R >3.6.0 and previous versions. Specify which R version you used for the random process, and don't be surprised if you get slightly different results when following old questions. To get the same result in such cases, you can use theRNGversion()
-function beforeset.seed()
(e.g.:RNGversion("3.5.2")
).(这是我的建议如何编写可重现的示例。我尝试过使其简短而有趣。“工作流程:”的第 9.2 节 r4ds 中的“获取帮助”是最近的一个内容,其中还讨论了 reprex 包。)
如何编写可重现的示例
如果您提供可重现的示例,您很可能会在 R 问题上获得良好的帮助。可重现的示例允许其他人通过复制和粘贴 R 代码来重现您的问题。
您需要包含四件事以使您的示例可重现:所需的包、数据、代码和 R 环境的描述。
包应该加载在脚本的顶部,这样很容易
查看示例需要哪些。
在电子邮件或 Stack Overflow 问题中包含数据的最简单方法是使用
dput()
生成 R 代码来重新创建它。例如,要在 R 中重新创建mtcars
数据集,我将执行以下步骤:
dput(mtcars)
mtcars <-
然后粘贴。花一点时间确保您的代码易于其他人使用
阅读:
确保您使用了空格并且变量名称简洁,但是
信息丰富
使用注释指出您的问题所在
尽力删除与问题无关的所有内容。
代码越短,就越容易理解。
将
sessionInfo()
的输出包含在代码的注释中。这总结了您的 R环境,并可以轻松检查您是否使用过时的
包。
您可以通过启动一个新的 R 会话并粘贴脚本来检查您是否确实创建了一个可重现的示例。
在将所有代码放入电子邮件中之前,请考虑将其放在 Gist github。它将为您的代码提供良好的语法突出显示,并且您不必担心电子邮件系统会破坏任何内容。
(Here's my advice from How to write a reproducible example. I've tried to make it short but sweet. Section 9.2 of "Workflow: Getting help" in r4ds is a more recent take that also discusses the reprex package.)
How to write a reproducible example
You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code.
You need to include four things to make your example reproducible: required packages, data, code, and a description of your R environment.
Packages should be loaded at the top of the script, so it's easy to
see which ones the example needs.
The easiest way to include data in an email or Stack Overflow question is to use
dput()
to generate the R code to recreate it. For example, to recreate themtcars
dataset in R,I'd perform the following steps:
dput(mtcars)
in Rmtcars <-
then paste.Spend a little bit of time ensuring that your code is easy for others to
read:
Make sure you've used spaces and your variable names are concise, but
informative
Use comments to indicate where your problem lies
Do your best to remove everything that is not related to the problem.
The shorter your code is, the easier it is to understand.
Include the output of
sessionInfo()
in a comment in your code. This summarises your Renvironment and makes it easy to check if you're using an out-of-date
package.
You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in.
Before putting all of your code in an email, consider putting it on Gist github. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system.
就我个人而言,我更喜欢“一个”衬垫。大致意思是:
数据结构应该模仿作者问题的想法,而不是精确的逐字结构。当变量不覆盖我自己的变量或上帝禁止的函数(如
df
)时,我真的很感激。或者,可以走捷径并指向预先存在的数据集,例如:
不要忘记提及您可能正在使用的任何特殊包。
如果您想在较大的对象上演示某些内容,则可以尝试
如果您通过
raster
包处理空间数据,则可以生成一些随机数据。在包 vignette 中可以找到很多示例,但这里有一个小亮点。如果您需要在
sp
中实现的一些空间对象,您可以通过“空间”包中的外部文件(如 ESRI shapefile)获取一些数据集(请参阅任务视图中的空间视图)。Personally, I prefer "one" liners. Something along the lines:
The data structure should mimic the idea of the writer's problem and not the exact verbatim structure. I really appreciate it when variables don't overwrite my own variables or god forbid, functions (like
df
).Alternatively, one could cut a few corners and point to a pre-existing data set, something like:
Don't forget to mention any special packages you might be using.
If you're trying to demonstrate something on larger objects, you can try
If you're working with spatial data via the
raster
package, you can generate some random data. A lot of examples can be found in the package vignette, but here's a small nugget.If you need some spatial object as implemented in
sp
, you can get some datasets via external files (like ESRI shapefile) in "spatial" packages (see the Spatial view in Task Views).受这篇文章的启发,我现在使用一个方便的功能,
当我需要发布到 StackOverflow 时,
regenerate()
。快速说明
如果
myData
是要重现的对象的名称,请在 R 中运行以下命令:详细信息:
此函数是
dput
的智能包装器,并执行以下操作:dput
输出源代码可在此处获取:
示例:
DF 约为 100 x 102。我想对 10 行和一些特定列进行采样
给出以下输出:
另请注意,整个输出位于一个漂亮的单行长行中,而不是一段高大的切碎的段落上行。
这使得阅读堆栈溢出问题帖子变得更容易,也更容易复制+粘贴。
2013 年 10 月更新:
您现在可以指定文本输出将占用多少行(即,您将粘贴到 StackOverflow 中的内容)。为此,请使用lines.out=n 参数。示例:
reproduct(DF, cols=c(1:3, 17, 23),lines.out=7)
产生:Inspired by this very post, I now use a handy function,
reproduce(<mydata>)
when I need to post to Stack Overflow.Quick instructions
If
myData
is the name of your object to reproduce, run the following in R:Details:
This function is an intelligent wrapper to
dput
and does the following:dput
outputobjName <- ...
, so that it can be easily copy+pasted, but...The source is available here:
Example:
DF is about 100 x 102. I want to sample 10 rows and a few specific columns
Gives the following output:
Notice also that the entirety of the output is in a nice single, long line, not a tall paragraph of chopped up lines.
This makes it easier to read on Stack Overflow questions posts and also easier to copy+paste.
Update Oct 2013:
You can now specify how many lines of text output will take up (i.e., what you will paste into Stack Overflow). Use the
lines.out=n
argument for this. Example:reproduce(DF, cols=c(1:3, 17, 23), lines.out=7)
yields:这是一个很好的
最重要的一点是:编写一小段代码,我们可以运行它来查看问题所在。一个有用的函数是 dput(),但如果您有非常大的数据,那么您可能想要制作一个小样本数据集或仅使用前 10 行左右。
编辑:
此外,请确保您确定问题出在您自己身上。该示例不应是带有“On line 200 There is an error”的完整 R 脚本。如果您使用 R(我喜欢
browser()
)和 Google 中的调试工具,那么您应该能够真正确定问题出在哪里,并重现一个简单的示例,其中出现同样的问题。Here is a good guide.
The most important point is: Make a small piece of code that we can run to see what the problem is. A useful function for this is
dput()
, but if you have very large data, then you might want to make a small sample dataset or only use the first 10 lines or so.EDIT:
Also, make sure that you identified where the problem is yourself. The example should not be an entire R script with "On line 200 there is an error". If you use the debugging tools in R (I love
browser()
) and Google, then you should be able to really identify where the problem is and reproduce a trivial example in which the same thing goes wrong.R-help 邮件列表有一个发帖指南,涵盖提问和回答问题,包括生成数据的示例:
小这个词尤其重要。您应该瞄准一个最小可重现的示例,这意味着数据和代码应该尽可能简单来解释问题。
编辑:漂亮的代码比丑陋的代码更容易阅读。使用样式指南。
The R-help mailing list has a posting guide which covers both asking and answering questions, including an example of generating data:
The word small is especially important. You should be aiming for a minimal reproducible example, which means that the data and the code should be as simple as possible to explain the problem.
EDIT: Pretty code is easier to read than ugly code. Use a style guide.
从 R.2.14(我猜)开始,您可以将数据文本表示直接提供给
read.table
:Since R.2.14 (I guess) you can feed your data text representation directly to
read.table
:有时,无论您多么努力,问题确实无法用较小的数据重现,并且对于合成数据也不会发生(尽管它有助于展示您如何生成没有的合成数据集
simulate()
方法根据估计参数生成模拟的新样本数据。simulate()
(lme4
) 或simulate_new()
(glmmTMB
) 函数可以模拟给定预测变量的响应,并且参数,例如:glmmTMB
包还有一个小插图,更详细地描述了模拟过程。如果您不能做到其中任何一个,那么您可能需要聘请顾问来解决您的问题...
编辑:关于匿名/加扰的两个有用的SO问题:
Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses).
simulate()
method to generate new sample data simulated from the estimated parameters.simulate()
(lme4
) orsimulate_new()
(glmmTMB
) functions can simulate responses given predictor variables and parameters, e.g.:The
glmmTMB
package also has a vignette that describes the simulation process in more detail.If you can't do any of these then you probably need to hire a consultant to solve your problem ...
edit: Two useful SO questions for anonymization/scrambling:
到目前为止,对于可重复性部分来说,答案显然非常好。这只是为了澄清可重现的示例不能也不应该是问题的唯一组成部分。不要忘记解释你希望它是什么样子以及你的问题的轮廓,而不仅仅是到目前为止你是如何尝试实现这一目标的。代码还不够;你也需要言语。
这是一个要避免做什么的可重现示例(取自真实示例,更改名称是为了保护无辜者):
以下是示例数据和我遇到问题的部分函数。
我怎样才能做到这一点?
The answers so far are obviously great for the reproducibility part. This is merely to clarify that a reproducible example cannot and should not be the sole component of a question. Don't forget to explain what you want it to look like and the contours of your problem, not just how you have attempted to get there so far. Code is not enough; you need words also.
Here's a reproducible example of what to avoid doing (drawn from a real example, names changed to protect the innocent):
The following is sample data and part of function I have trouble with.
How can I achieve this ?
我有一个非常简单有效的方法来制作上面没有提到的 R 示例。
您可以首先定义您的结构。例如,
然后您可以手动输入数据。这对于较小的例子比大的例子更有效。
I have a very easy and efficient way to make a R example that has not been mentioned above.
You can define your structure firstly. For example,
Then you can input your data manually. This is efficient for smaller examples rather than big ones.
指南:
您提出问题的主要目标应该是让读者尽可能轻松地理解并在他们的系统上重现您的问题。为此:
这确实需要一些工作,但这似乎是一个公平的权衡,因为你问了别人为你做事。
提供数据:
内置数据集
到目前为止最好的选择是依赖内置数据集。这使得其他人很容易解决您的问题。在 R 提示符处输入
data()
以查看可用的数据。一些经典的例子:iris
mtcars
ggplot2::diamonds
(外部包,但几乎每个人都有)检查内置数据集以找到适合您问题的数据集。
如果您可以重新表述您的问题以使用内置数据集,您更有可能获得好的答案(和赞成票)。
自行生成的数据
如果您的问题特定于现有数据集中未表示的数据类型,请提供 R 代码,以生成您的问题表现出来的最小可能的数据集。例如,
尝试回答我的问题的人可以复制/粘贴这两行并立即开始解决问题。
dput
作为最后的手段,您可以使用
dput
将数据对象转换为R 代码(例如dput(myData)
)。我说这是“最后的手段”,因为 dput 的输出通常相当笨重,复制粘贴很烦人,并且掩盖了问题的其余部分。提供预期输出:
有人曾经说过:
如果你可以在你的问题中添加类似“我期望得到这个结果”这样的
内容,人们更有可能很快理解你想要做什么。如果您的预期结果很大且难以处理,那么您可能没有充分考虑如何简化您的问题(见下文)。
简洁地解释你的问题
最主要的事情是在提出问题之前尽可能地简化你的问题。重新构建问题以使用内置数据集将在这方面有很大帮助。你还会经常发现,只要经历简化的过程,你就能回答自己的问题。
以下是一些好问题的示例:
在这两种情况下,用户的问题几乎肯定不是以及他们提供的简单示例。相反,他们抽象了问题的本质,并将其应用于简单的数据集来提出他们的问题。
为什么这个问题还有另一个答案?
这个答案重点关注我认为的最佳实践:使用内置数据集并以最小的形式提供您期望的结果。最突出的答案集中在其他方面。我不认为这个答案会引起任何关注。这只是为了让我可以在新手问题的评论中链接到它。
Guidelines:
Your main objective in crafting your questions should be to make it as easy as possible for readers to understand and reproduce your problem on their systems. To do so:
This does take some work, but it seems like a fair trade-off since you ask others to do work for you.
Providing Data:
Built-in Data Sets
The best option by far is to rely on built-in datasets. This makes it very easy for others to work on your problem. Type
data()
at the R prompt to see what data is available to you. Some classic examples:iris
mtcars
ggplot2::diamonds
(external package, but almost everyone has it)Inspect the built-in datasets to find one suitable for your problem.
If you can rephrase your problem to use the built-in datasets, you are much more likely to get good answers (and upvotes).
Self Generated Data
If your problem is specific to a type of data that is not represented in the existing data sets, then provide the R code that generates the smallest possible dataset that your problem manifests itself on. For example
Someone trying to answer my question can copy/paste those two lines and start working on the problem immediately.
dput
As a last resort, you can use
dput
to transform a data object to R code (e.g.dput(myData)
). I say as a "last resort" because the output ofdput
is often fairly unwieldy, annoying to copy-paste, and obscures the rest of your question.Provide Expected Output:
Someone once said:
If you can add something like "I expected to get this result":
to your question, people are much more likely to understand what you are trying to do quickly. If your expected result is large and unwieldy, then you probably haven't thought enough about how to simplify your problem (see next).
Explain Your Problem Succinctly
The main thing to do is simplify your problem as much as possible before you ask your question. Re-framing the problem to work with the built-in datasets will help a lot in this regard. You will also often find that just by going through the process of simplification, you will answer your own problem.
Here are some examples of good questions:
In both cases, the user's problems are almost certainly not with the simple examples they provide. Rather they abstracted the nature of their problem and applied it to a simple data set to ask their question.
Why Yet Another Answer To This Question?
This answer focuses on what I think is the best practice: use built-in data sets and provide what you expect as a result in a minimal form. The most prominent answers focus on other aspects. I don't expect this answer to rising to any prominence; this is here solely so that I can link to it in comments to newbie questions.
要快速创建数据的
dput
,您只需将(一部分)数据复制到剪贴板,然后在 R 中运行以下命令:对于 Excel 中的数据:
对于 .txt 中的数据 文件:
如果需要,您可以更改后者中的
sep
。当然,只有当您的数据位于剪贴板中时,这才有效。
To quickly create a
dput
of your data you can just copy (a piece of) the data to your clipboard and run the following in R:For data in Excel:
For data in a .txt file:
You can change the
sep
in the latter if necessary.This will only work if your data is in the clipboard of course.
可重现的代码是获得帮助的关键。然而,许多用户可能对粘贴一小部分数据持怀疑态度。例如,他们可能正在处理敏感数据或收集用于研究论文的原始数据。
出于某种原因,我认为在公开粘贴数据之前有一个方便的功能来“变形”我的数据会很好。
SciencesPo
包中的anonymize
函数非常愚蠢,但对我来说,它与dput
函数配合得很好。然后我将其匿名化:
在应用匿名化和 dput 命令之前,人们可能还想对一些变量而不是整个数据进行采样。
Reproducible code is the key to get help. However, there are many users that might be sceptical of pasting even a chunk of their data. For instance, they could be working with sensitive data or on original data collected to use in a research paper.
For any reason, I thought it would be nice to have a handy function for "deforming" my data before pasting it publicly. The
anonymize
function from the packageSciencesPo
is very silly, but for me it works nicely with thedput
function.Then I anonymize it:
One may also want to sample a few variables instead of the whole data before applying the anonymization and dput command.
通常,您需要一些数据作为示例,但是,您不想发布确切的数据。要使用已建立的库中的某些现有数据框,请使用数据命令导入它。
例如,
然后做问题
Often you need some data for an example, however, you don't want to post your exact data. To use some existing data.frame in established library, use data command to import it.
e.g.,
and then do the problem
如果您有一个大型数据集,无法使用 dput() 轻松放入脚本中,请将数据发布到 Pastebin 并使用
read.table
加载它们:灵感来自 由 Henrik。
If you have a large dataset which cannot be easily put to the script using
dput()
, post your data to pastebin and load them usingread.table
:Inspired by Henrik.
我正在开发 wakefield 包来满足快速共享可重现的需求数据,有时
dput
对于较小的数据集工作得很好,但我们处理的许多问题要大得多,通过dput
共享如此大的数据集是不切实际的。关于:
wakefield 允许用户共享最少的代码来重现数据。用户设置 n(行数)并指定任意数量的预设变量函数(目前有 70 个)来模拟真实的 if 数据(例如性别、年龄、收入等)
安装:
目前(2015-06-11),wakefield 是一个 GitHub 包,但在编写单元测试后最终会转到 CRAN。要快速安装,请使用:
示例:
这是一个示例:
这会产生:
I am developing the wakefield package to address this need to quickly share reproducible data, sometimes
dput
works fine for smaller data sets but many of the problems we deal with are much larger, sharing such a large data set viadput
is impractical.About:
wakefield allows the user to share minimal code to reproduce data. The user sets
n
(number of rows) and specifies any number of preset variable functions (there are currently 70) that mimic real if data (things like gender, age, income etc.)Installation:
Currently (2015-06-11), wakefield is a GitHub package but will go to CRAN eventually after unit tests are written. To install quickly, use:
Example:
Here is an example:
This produces:
如果您的数据中有一个或多个
factor
变量,并且您希望使用dput(head(mydata))
重现这些变量,请考虑添加droplevels< /code> 到它,以便最小化数据集中不存在的因素级别不会包含在您的
dput
输出中,以使示例最小:If you have one or more
factor
variable(s) in your data that you want to make reproducible withdput(head(mydata))
, consider addingdroplevels
to it, so that levels of factors that are not present in the minimized data set are not included in yourdput
output, in order to make the example minimal:最初的帖子提到了 Datacamp 现已退役的 r-fiddle 服务。它已被重新命名为 datacamp light,并且不能像我的答案所示那样轻松嵌入。
The original post referred to the now retired r-fiddle service from datacamp. It has been rebranded as datacamp light and can not as easily embedded as indicated by my answer.
请不要像这样粘贴控制台输出:
我们无法直接复制粘贴它。
为了使问题和答案能够正确重现,请尝试删除
+
&在发布之前>
并为输出和注释添加#
,如下所示:还有一件事,如果您使用了某个包中的任何函数,请提及该库。
Please do not paste your console outputs like this:
We can not copy-paste it directly.
To make questions and answers properly reproducible, try to remove
+
&>
before posting it and put#
for outputs and comments like this:One more thing, if you have used any function from certain package, mention that library.
您可以使用 reprex 来完成此操作。
正如 mt1022 指出的,“...用于生成最小的、可重现的示例的好包是来自 "reprex" href="https://www.tidyverse.org" rel="noreferrer">tidyverse"。
根据Tidyverse:
tidyverse 网站上给出了一个示例。
我认为这是创建可重现示例的最简单的方法。
You can do this using reprex.
As mt1022 noted, "... good package for producing minimal, reproducible example is "reprex" from tidyverse".
According to Tidyverse:
An example is given on tidyverse web site.
I think this is the simplest way to create a reproducible example.
除了我发现非常有趣的上述所有答案之外,有时可能非常简单,正如这里讨论的那样: 如何制作一个最小的可重现示例以获得有关 R 的帮助
制作随机向量的方法有很多 <一href="https://stackoverflow.com/questions/17772505/create-a-100-number-vector-with-random-values-in-r-rounded-to-2-decimals">创建一个 100 数字向量R中的随机值四舍五入到2位小数或R中的随机矩阵:
请注意,有时由于维度等各种原因,共享给定的数据非常困难。但是,以上所有答案非常棒,当人们想要制作一个可重现的数据示例时,思考和使用它们非常重要。但请注意,为了使数据与原始数据一样具有代表性(以防OP无法共享原始数据),最好在数据示例中添加一些信息,如下所示(如果我们将数据称为mydf1)
此外,应该知道数据的类型、长度和属性,可以是数据结构
Apart from all the above answers which I found very interesting, it could sometimes be very easy as it is discussed here: How to make a minimal reproducible example to get help with R
There are many ways to make a random vector Create a 100 number vector with random values in R rounded to 2 decimals or a random matrix in R:
Note that sometimes it is very difficult to share a given data because of various reasons such as dimension, etc. However, all the above answers are great, and they are very important to think about and use when one wants to make a reproducible data example. But note that in order to make data as representative as the original (in case the OP cannot share the original data), it is good to add some information with the data example as (if we call the data mydf1)
Moreover, one should know the type, length and attributes of a data which can be Data structures
以下是我的一些建议:
require
或library
人们会理解的尽量简洁,
所有这些都是可重现示例的一部分。
Here are some of my suggestions:
dput
, so others can help you more easilyinstall.package()
unless it is really necessary, people will understand if you just userequire
orlibrary
Try to be concise,
All these are part of a reproducible example.
最好使用 testthat 包中的函数来显示您期望发生的情况。因此,其他人可以更改您的代码,直到它运行没有错误为止。这减轻了那些想要帮助您的人的负担,因为这意味着他们不必解码您的文字描述。例如
,比“我认为当 y 等于或超过 10 时 x 会是 1.23,否则是 3.21,但我没有得到任何结果”更清楚。即使在这个愚蠢的例子中,我认为代码比文字更清晰。使用
testthat
可以让你的帮助者专注于代码,这可以节省时间,并且可以让他们在发布问题之前知道他们已经解决了你的问题It's a good idea to use functions from the
testthat
package to show what you expect to occur. Thus, other people can alter your code until it runs without error. This eases the burden of those who would like to help you, because it means they don't have to decode your textual description. For exampleis clearer than "I think x would come out to be 1.23 for y equal to or exceeding 10, and 3.21 otherwise, but I got neither result". Even in this silly example, I think the code is clearer than the words. Using
testthat
lets your helper focus on the code, which saves time, and it provides a way for them to know they have solved your problem, before they post it