如何制作一个出色的 R 可重现示例

发布于 2024-11-06 06:28:20 字数 611 浏览 0 评论 0 原文

当与同事讨论性能、教学、发送错误报告或在邮件列表和 StackOverflow 上搜索指导时,可重现的示例 经常被问到并且总是有帮助。

创建优秀示例的技巧是什么?如何将 中的数据结构粘贴到文本中格式?您还应该包括哪些其他信息?

除了使用dput()dump()struction()之外还有其他技巧吗?什么时候应该包含 library()require() 语句?除了 cdfdata 等之外,还应该避免哪些保留字?

如何制作一个出色的 可重现示例?

When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on Stack Overflow, a reproducible example is often asked and always helpful.

What are your tips for creating an excellent example? How do you paste data structures from in a text format? What other information should you include?

Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc.?

How does one make a great reproducible example?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(23

蓝天 2024-11-13 06:28:20

基本上,最小可重现示例 (MRE) 应该使其他人能够准确地 在他们的机器上重现您的问题。

请不要发布您的数据、代码或控制台输出的图像!

简要摘要

MRE 包含以下项目:

  • 最小数据集,用于演示问题
  • <重现问题所需的strong>最小可运行代码,可以
  • 在给定数据集上运行所用库必要信息 >s、R 版本及其运行的操作系统 ,可能是一个sessionInfo()
  • 在随机进程的情况下 ,一个seed(由set.seed()设置),使其他人能够复制与您完全相同的结果

有关良好 MRE 的示例,请参阅您正在使用的函数的帮助页面底部的“示例”部分。只需在 R 控制台中输入例如 help(mean) 或简短的 ?mean 即可。

提供最小的数据集

通常,共享庞大的数据集是没有必要的,而且可能会阻止其他人阅读您的问题。因此,最好使用内置数据集或创建一个类似于原始数据的小“玩具”示例,这实际上就是最小的含义。如果由于某种原因您确实需要共享您的原始数据,您应该使用一种方法,例如 dput() ,该方法允许其他人获得您数据的精确副本。

内置数据集

您可以使用内置数据集之一。使用 data() 可以查看内置数据集的完整列表。每个数据集都有一个简短的描述,并且可以获取更多信息,例如使用 ?iris,获取 R 附带的“iris”数据集。安装的包可能包含其他数据集。

创建示例数据集

初步说明:有时您可能需要特殊格式(即类),例如因子、日期或时间序列。对于这些,请使用以下函数:as.factoras.Dateas.xts、...示例:

d <- as.Date("2020-12-30")

其中

class(d)
# [1] "Date"

向量

x <- rnorm(10)  ## random vector normal distributed
x <- runif(10)  ## random vector uniformly distributed    
x <- sample(1:100, 10)  ## 10 random draws out of 1, 2, ..., 100    
x <- sample(LETTERS, 10)  ## 10 random draws out of built-in latin alphabet

矩阵

m <- matrix(1:12, 3, 4, dimnames=list(LETTERS[1:3], LETTERS[1:4]))
m
#   A B C  D
# A 1 4 7 10
# B 2 5 8 11
# C 3 6 9 12

数据框

set.seed(42)  ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
dat
#   id       date group age   type         x
# 1  1 2020-12-26     A  27 type 1 0.0356312
# 2  2 2020-12-27     B  19 type 2 1.3149588
# 3  3 2020-12-28     A  20 type 3 0.9781675
# 4  4 2020-12-29     B  26 type 4 0.8817912
# 5  5 2020-12-30     A  26 type 5 0.4822047
# 6  6 2020-12-31     B  28 type 6 0.9657529

注意:虽然它被广泛使用,但最好不要将数据框命名为 df,因为df() 是 F 分布密度(即 x 点曲线的高度)的 R 函数,您可能会与它发生冲突。

复制原始数据

如果您有特定原因,或者数据太难构建示例,您可以提供原始数据的一小部分,最好使用 dput

为什么使用dput()

dput 抛出在控制台上准确重现数据所需的所有信息。您只需复制输出并将其粘贴到您的问题中即可。

调用 dat (从上面)生成的输出仍然缺少有关变量类和其他功能的信息(如果您在问题中共享它)。此外,type 列中的空格使得很难对其执行任何操作。即使我们开始使用这些数据,我们也无法正确获取您数据的重要特征。

  id       date group age   type         x
1  1 2020-12-26     A  27 type 1 0.0356312
2  2 2020-12-27     B  19 type 2 1.3149588
3  3 2020-12-28     A  20 type 3 0.9781675

对数据进行子集化

要共享子集,请使用 head()subset() 或索引 iris[1:4, ] 。然后将其包装到 dput() 中,以便为其他人提供可以立即放入 R 中的东西。 示例

dput(iris[1:4, ]) # first four rows of the iris data set

在您的问题中分享的控制台输出:

structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), row.names = c(NA, 
4L), class = "data.frame")

使用dput时,您可能还希望仅包含相关列,例如 dput(mtcars[1 :3, c(2, 5, 6)])

注意: 如果您的数据框具有多个级别的因子,则 dput 输出可能会很笨重,因为它仍然会列出所有可能的因素水平,即使它们不存在于数据子集中。要解决此问题,您可以使用 droplevels() 函数。请注意下面的物种如何成为只有一个级别的因子,例如 dput(droplevels(iris[1:4, ])) 。 dput 的另一个警告是,它不适用于键控 data.table 对象或分组 tbl_df(类 grouped_df code>)来自 tidyverse。在这些情况下,您可以在共享之前转换回常规数据帧,dput(as.data.frame(my_data))

考虑使用建设性包以获得更清晰的结果

使用constructive::construct(iris[1:4, ]) 相反上面的 dput(iris[1:4,]) 给出了这个输出,它更加紧凑并且更容易阅读(例如,带有重复因子值的长字符串的示例将给出使用 construct() 的更有力理由 ...)

data.frame(
  Sepal.Length = c(5.1, 4.9, 4.7, 4.6),
  Sepal.Width = c(3.5, 3, 3.2, 3.1),
  Petal.Length = c(1.4, 1.4, 1.3, 1.5),
  Petal.Width = rep(0.2, 4L),
  Species = factor(rep("setosa", 4L), levels = c("setosa", "versicolor", "virginica"))
)

生成最少的代码

与最少的数据(见上文)相结合,您的代码应该通过简单地复制和粘贴来在另一台计算机上准确地重现问题。

这应该是容易的部分,但通常并非如此。你不应该做的事情:

  • 显示各种数据转换;确保提供的数据已经采用正确的格式(当然,除非这是问题所在)
  • 复制粘贴整个脚本,该脚本会在某处出现错误。尝试找出到底是哪些行导致了错误。很多时候,你会发现问题出在你自己身上。

您应该做什么:

  • 添加您使用的包(如果您使用任何包(使用 library()))
  • 在新的 R 会话中测试运行您的代码,以确保代码可运行。人们应该能够在控制台中复制粘贴您的数据和代码,并获得与您相同的结果。
  • 如果您打开连接或创建文件,请添加一些代码来关闭它们或删除文件(使用 unlink()
  • 如果您更改选项,请确保代码包含将它们恢复为原始状态的语句那些。 (例如 op <- par(mfrow=c(1,2)) ...some code... par(op)

提供必要的信息

在大多数情况下,只需提供 R 版本和操作系统就足够了。当包发生冲突时,提供 sessionInfo() 的输出确实很有帮助。在谈论与其他应用程序的连接(无论是通过 ODBC 还是其他任何方式)时,还应该提供这些应用程序的版本号,如果可能的话,还应该提供有关设置的必要信息。

如果您在 R Studio 中运行 R,使用 rstudioapi::versionInfo() 可以帮助报告您的 RStudio 版本。

如果您对特定包有问题,您可能需要通过给出 packageVersion("name of the package") 的输出来提供包版本。

种子

使用set.seed(),您可以指定种子1,即R 的随机数生成器固定的特定状态。这使得随机函数(例如 sample()rnorm()runif() 和许多其他函数)始终返回相同的结果,示例:

set.seed(42)
rnorm(3)
# [1]  1.3709584 -0.5646982  0.3631284

set.seed(42)
rnorm(3)
# [1]  1.3709584 -0.5646982  0.3631284

1 注意: set.seed() 的输出因R>3.6.0及之前的版本。指定您用于随机过程的 R 版本,如果您在遵循旧问题时得到略有不同的结果,请不要感到惊讶。要在这种情况下获得相同的结果,您可以在 set.seed() 之前使用 RNGversion() 函数(例如:RNGversion("3.5.2 “))。

Basically, a minimal reproducible example (MRE) should enable others to exactly reproduce your issue on their machines.

Please do not post images of your data, code, or console output!

Brief summary

A MRE consists of the following items:

  • a minimal dataset, necessary to demonstrate the problem
  • the minimal runnable code necessary to reproduce the issue, which can be run on the given dataset
  • all necessary information on the used librarys, the R version, and the OS it is run on, perhaps a sessionInfo()
  • in the case of random processes, a seed (set by set.seed()) to enable others to replicate exactly the same results as you have

For examples of good MREs, see section "Examples" at the bottom of help pages on the function you are using. Simply type e.g. help(mean), or short ?mean into your R console.

Providing a minimal dataset

Usually, sharing huge data sets is not necessary and may rather discourage others from reading your question. Therefore, it is better to use built-in datasets or create a small "toy" example that resembles your original data, which is actually what is meant by minimal. If for some reason you really need to share your original data, you should use a method, such as dput(), that allows others to get an exact copy of your data.

Built-in datasets

You can use one of the built-in datasets. A comprehensive list of built-in datasets can be seen with data(). There is a short description of every data set, and more information can be obtained, e.g. with ?iris, for the 'iris' data set that comes with R. Installed packages might contain additional datasets.

Creating example data sets

Preliminary note: Sometimes you may need special formats (i.e. classes), such as factors, dates, or time series. For these, make use of functions like: as.factor, as.Date, as.xts, ... Example:

d <- as.Date("2020-12-30")

where

class(d)
# [1] "Date"

Vectors

x <- rnorm(10)  ## random vector normal distributed
x <- runif(10)  ## random vector uniformly distributed    
x <- sample(1:100, 10)  ## 10 random draws out of 1, 2, ..., 100    
x <- sample(LETTERS, 10)  ## 10 random draws out of built-in latin alphabet

Matrices

m <- matrix(1:12, 3, 4, dimnames=list(LETTERS[1:3], LETTERS[1:4]))
m
#   A B C  D
# A 1 4 7 10
# B 2 5 8 11
# C 3 6 9 12

Data frames

set.seed(42)  ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
dat
#   id       date group age   type         x
# 1  1 2020-12-26     A  27 type 1 0.0356312
# 2  2 2020-12-27     B  19 type 2 1.3149588
# 3  3 2020-12-28     A  20 type 3 0.9781675
# 4  4 2020-12-29     B  26 type 4 0.8817912
# 5  5 2020-12-30     A  26 type 5 0.4822047
# 6  6 2020-12-31     B  28 type 6 0.9657529

Note: Although it is widely used, better to not name your data frame df, because df() is an R function for the density (i.e. height of the curve at point x) of the F distribution and you might get a clash with it.

Copying original data

If you have a specific reason, or data that would be too difficult to construct an example from, you could provide a small subset of your original data, best by using dput.

Why use dput()?

dput throws all information needed to exactly reproduce your data on your console. You may simply copy the output and paste it into your question.

Calling dat (from above) produces output that still lacks information about variable classes and other features if you share it in your question. Furthermore, the spaces in the type column make it difficult to do anything with it. Even when we set out to use the data, we won't manage to get important features of your data right.

  id       date group age   type         x
1  1 2020-12-26     A  27 type 1 0.0356312
2  2 2020-12-27     B  19 type 2 1.3149588
3  3 2020-12-28     A  20 type 3 0.9781675

Subset your data

To share a subset, use head(), subset() or the indices iris[1:4, ]. Then wrap it into dput() to give others something that can be put in R immediately. Example

dput(iris[1:4, ]) # first four rows of the iris data set

Console output to share in your question:

structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), row.names = c(NA, 
4L), class = "data.frame")

When using dput, you may also want to include only relevant columns, e.g. dput(mtcars[1:3, c(2, 5, 6)])

Note: If your data frame has a factor with many levels, the dput output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the subset of your data. To solve this issue, you can use the droplevels() function. Notice below how species is a factor with only one level, e.g. dput(droplevels(iris[1:4, ])). One other caveat for dput is that it will not work for keyed data.table objects or for grouped tbl_df (class grouped_df) from the tidyverse. In these cases you can convert back to a regular data frame before sharing, dput(as.data.frame(my_data)).

consider using the constructive package for cleaner results

Using constructive::construct(iris[1:4,]) instead of dput(iris[1:4,]) as above gives this output, which is a little bit more compact and easier to read (examples with, for example, long strings of repeated factor values will give an even stronger reason to use construct() ...)

data.frame(
  Sepal.Length = c(5.1, 4.9, 4.7, 4.6),
  Sepal.Width = c(3.5, 3, 3.2, 3.1),
  Petal.Length = c(1.4, 1.4, 1.3, 1.5),
  Petal.Width = rep(0.2, 4L),
  Species = factor(rep("setosa", 4L), levels = c("setosa", "versicolor", "virginica"))
)

Producing minimal code

Combined with the minimal data (see above), your code should exactly reproduce the problem on another machine by simply copying and pasting it.

This should be the easy part but often isn't. What you should not do:

  • showing all kinds of data conversions; make sure the provided data is already in the correct format (unless that is the problem, of course)
  • copy-paste a whole script that gives an error somewhere. Try to locate which lines exactly result in the error. More often than not, you'll find out what the problem is yourself.

What you should do:

  • add which packages you use if you use any (using library())
  • test run your code in a fresh R session to ensure the code is runnable. People should be able to copy-paste your data and your code in the console and get the same as you have.
  • if you open connections or create files, add some code to close them or delete the files (using unlink())
  • if you change options, make sure the code contains a statement to revert them back to the original ones. (eg op <- par(mfrow=c(1,2)) ...some code... par(op) )

Providing necessary information

In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of sessionInfo() can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible, also the necessary information on the setup.

If you are running R in R Studio, using rstudioapi::versionInfo() can help report your RStudio version.

If you have a problem with a specific package, you may want to provide the package version by giving the output of packageVersion("name of the package").

Seed

Using set.seed() you may specify a seed1, i.e. the specific state in which R's random number generator is fixed. This makes it possible for random functions, such as sample(), rnorm(), runif() and lots of others, to always return the same result, Example:

set.seed(42)
rnorm(3)
# [1]  1.3709584 -0.5646982  0.3631284

set.seed(42)
rnorm(3)
# [1]  1.3709584 -0.5646982  0.3631284

1 Note: The output of set.seed() differs between R >3.6.0 and previous versions. Specify which R version you used for the random process, and don't be surprised if you get slightly different results when following old questions. To get the same result in such cases, you can use the RNGversion()-function before set.seed() (e.g.: RNGversion("3.5.2")).

月寒剑心 2024-11-13 06:28:20

(这是我的建议如何编写可重现的示例。我尝试过使其简短而有趣。“工作流程:”的第 9.2 节 r4ds 中的“获取帮助”是最近的一个内容,其中还讨论了 reprex 包。)

如何编写可重现的示例

如果您提供可重现的示例,您很可能会在 R 问题上获得良好的帮助。可重现的示例允许其他人通过复制和粘贴 R 代码来重现您的问题。

您需要包含四件事以使您的示例可重现:所需的包、数据、代码和 R 环境的描述。

  • 应该加载在脚本的顶部,这样很容易
    查看示例需要哪些。

  • 在电子邮件或 Stack Overflow 问题中包含数据的最简单方法是使用 dput() 生成 R 代码来重新创建它。例如,要在 R 中重新创建 mtcars 数据集,
    我将执行以下步骤:

    1. 在 R 中运行 dput(mtcars)
    2. 复制输出
    3. 在我的可重现脚本中,输入 mtcars <- 然后粘贴。
  • 花一点时间确保您的代码易于其他人使用
    阅读:

    • 确保您使用了空格并且变量名称简洁,但是
      信息丰富

    • 使用注释指出您的问题所在

    • 尽力删除与问题无关的所有内容。
      代码越短,就越容易理解。

  • sessionInfo() 的输出包含在代码的注释中。这总结了您的 R
    环境,并可以轻松检查您是否使用过时的
    包。

您可以通过启动一个新的 R 会话并粘贴脚本来检查您是否确实创建了一个可重现的示例。

在将所有代码放入电子邮件中之前,请考虑将其放在 Gist github。它将为您的代码提供良好的语法突出显示,并且您不必担心电子邮件系统会破坏任何内容。

(Here's my advice from How to write a reproducible example. I've tried to make it short but sweet. Section 9.2 of "Workflow: Getting help" in r4ds is a more recent take that also discusses the reprex package.)

How to write a reproducible example

You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code.

You need to include four things to make your example reproducible: required packages, data, code, and a description of your R environment.

  • Packages should be loaded at the top of the script, so it's easy to
    see which ones the example needs.

  • The easiest way to include data in an email or Stack Overflow question is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R,
    I'd perform the following steps:

    1. Run dput(mtcars) in R
    2. Copy the output
    3. In my reproducible script, type mtcars <- then paste.
  • Spend a little bit of time ensuring that your code is easy for others to
    read:

    • Make sure you've used spaces and your variable names are concise, but
      informative

    • Use comments to indicate where your problem lies

    • Do your best to remove everything that is not related to the problem.
      The shorter your code is, the easier it is to understand.

  • Include the output of sessionInfo() in a comment in your code. This summarises your R
    environment
    and makes it easy to check if you're using an out-of-date
    package.

You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in.

Before putting all of your code in an email, consider putting it on Gist github. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system.

不必你懂 2024-11-13 06:28:20

就我个人而言,我更喜欢“一个”衬垫。大致意思是:

my.df <- data.frame(col1 = sample(c(1,2), 10, replace = TRUE),
        col2 = as.factor(sample(10)), col3 = letters[1:10],
        col4 = sample(c(TRUE, FALSE), 10, replace = TRUE))
my.list <- list(list1 = my.df, list2 = my.df[3], list3 = letters)

数据结构应该模仿作者问题的想法,而不是精确的逐字结构。当变量不覆盖我自己的变量或上帝禁止的函数(如df)时,我真的很感激。

或者,可以走捷径并指向预先存在的数据集,例如:

library(vegan)
data(varespec)
ord <- metaMDS(varespec)

不要忘记提及您可能正在使用的任何特殊包。

如果您想在较大的对象上演示某些内容,则可以尝试

my.df2 <- data.frame(a = sample(10e6), b = sample(letters, 10e6, replace = TRUE))

如果您通过raster包处理空间数据,则可以生成一些随机数据。在包 vignette 中可以找到很多示例,但这里有一个小亮点。

library(raster)
r1 <- r2 <- r3 <- raster(nrow=10, ncol=10)
values(r1) <- runif(ncell(r1))
values(r2) <- runif(ncell(r2))
values(r3) <- runif(ncell(r3))
s <- stack(r1, r2, r3)

如果您需要在 sp 中实现的一些空间对象,您可以通过“空间”包中的外部文件(如 ESRI shapefile)获取一些数据集(请参阅任务视图中的空间视图)。

library(rgdal)
ogrDrivers()
dsn <- system.file("vectors", package = "rgdal")[1]
ogrListLayers(dsn)
ogrInfo(dsn=dsn, layer="cities")
cities <- readOGR(dsn=dsn, layer="cities")

Personally, I prefer "one" liners. Something along the lines:

my.df <- data.frame(col1 = sample(c(1,2), 10, replace = TRUE),
        col2 = as.factor(sample(10)), col3 = letters[1:10],
        col4 = sample(c(TRUE, FALSE), 10, replace = TRUE))
my.list <- list(list1 = my.df, list2 = my.df[3], list3 = letters)

The data structure should mimic the idea of the writer's problem and not the exact verbatim structure. I really appreciate it when variables don't overwrite my own variables or god forbid, functions (like df).

Alternatively, one could cut a few corners and point to a pre-existing data set, something like:

library(vegan)
data(varespec)
ord <- metaMDS(varespec)

Don't forget to mention any special packages you might be using.

If you're trying to demonstrate something on larger objects, you can try

my.df2 <- data.frame(a = sample(10e6), b = sample(letters, 10e6, replace = TRUE))

If you're working with spatial data via the raster package, you can generate some random data. A lot of examples can be found in the package vignette, but here's a small nugget.

library(raster)
r1 <- r2 <- r3 <- raster(nrow=10, ncol=10)
values(r1) <- runif(ncell(r1))
values(r2) <- runif(ncell(r2))
values(r3) <- runif(ncell(r3))
s <- stack(r1, r2, r3)

If you need some spatial object as implemented in sp, you can get some datasets via external files (like ESRI shapefile) in "spatial" packages (see the Spatial view in Task Views).

library(rgdal)
ogrDrivers()
dsn <- system.file("vectors", package = "rgdal")[1]
ogrListLayers(dsn)
ogrInfo(dsn=dsn, layer="cities")
cities <- readOGR(dsn=dsn, layer="cities")
熊抱啵儿 2024-11-13 06:28:20

受这篇文章的启发,我现在使用一个方便的功能,
当我需要发布到 StackOverflow 时,regenerate()


快速说明

如果 myData 是要重现的对象的名称,请在 R 中运行以下命令:

install.packages("devtools")
library(devtools)
source_url("https://raw.github.com/rsaporta/pubR/gitbranch/reproduce.R")

reproduce(myData)

详细信息:

此函数是 dput 的智能包装器,并执行以下操作:

  • 自动采样大型数据集(基于大小和类别。可以调整样本大小)
  • 创建 dput 输出
  • 允许您指定要导出的
  • 附加到其前面 <代码>objName <- ...,这样就可以轻松地复制+粘贴,但是...
  • 如果在 Mac 上工作,输出会自动复制到剪贴板,这样您只需运行它,然后将其粘贴到您的计算机上即可。问题。

源代码可在此处获取:


示例:

# sample data
DF <- data.frame(id=rep(LETTERS, each=4)[1:100], replicate(100, sample(1001, 100)), Class=sample(c("Yes", "No"), 100, TRUE))

DF 约为 100 x 102。我想对 10 行和一些特定列进行采样

reproduce(DF, cols=c("id", "X1", "X73", "Class"))  # I could also specify the column number.

给出以下输出:

This is what the sample looks like:

    id  X1 X73 Class
1    A 266 960   Yes
2    A 373 315    No            Notice the selection split
3    A 573 208    No           (which can be turned off)
4    A 907 850   Yes
5    B 202  46   Yes
6    B 895 969   Yes   <~~~ 70 % of selection is from the top rows
7    B 940 928    No
98   Y 371 171   Yes
99   Y 733 364   Yes   <~~~ 30 % of selection is from the bottom rows.
100  Y 546 641    No


    ==X==============================================================X==
         Copy+Paste this part. (If on a Mac, it is already copied!)
    ==X==============================================================X==

 DF <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 25L, 25L, 25L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y"), class = "factor"), X1 = c(266L, 373L, 573L, 907L, 202L, 895L, 940L, 371L, 733L, 546L), X73 = c(960L, 315L, 208L, 850L, 46L, 969L, 928L, 171L, 364L, 641L), Class = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("No", "Yes"), class = "factor")), .Names = c("id", "X1", "X73", "Class"), class = "data.frame", row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 98L, 99L, 100L))

    ==X==============================================================X==

另请注意,整个输出位于一个漂亮的单行长行中,而不是一段高大的切碎的段落上行。
这使得阅读堆栈溢出问题帖子变得更容易,也更容易复制+粘贴。


2013 年 10 月更新:

您现在可以指定文本输出将占用多少行(即,您将粘贴到 StackOverflow 中的内容)。为此,请使用lines.out=n 参数。示例:

reproduct(DF, cols=c(1:3, 17, 23),lines.out=7) 产生:

    ==X==============================================================X==
         Copy+Paste this part. (If on a Mac, it is already copied!)
    ==X==============================================================X==

 DF <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 25L,25L, 25L), .Label
      = c("A", "B", "C", "D", "E", "F", "G", "H","I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U","V", "W", "X", "Y"), class = "factor"),
      X1 = c(809L, 81L, 862L,747L, 224L, 721L, 310L, 53L, 853L, 642L),
      X2 = c(926L, 409L,825L, 702L, 803L, 63L, 319L, 941L, 598L, 830L),
      X16 = c(447L,164L, 8L, 775L, 471L, 196L, 30L, 420L, 47L, 327L),
      X22 = c(335L,164L, 503L, 407L, 662L, 139L, 111L, 721L, 340L, 178L)), .Names = c("id","X1",
      "X2", "X16", "X22"), class = "data.frame", row.names = c(1L,2L, 3L, 4L, 5L, 6L, 7L, 98L, 99L, 100L))

    ==X==============================================================X==

Inspired by this very post, I now use a handy function,
reproduce(<mydata>) when I need to post to Stack Overflow.


Quick instructions

If myData is the name of your object to reproduce, run the following in R:

install.packages("devtools")
library(devtools)
source_url("https://raw.github.com/rsaporta/pubR/gitbranch/reproduce.R")

reproduce(myData)

Details:

This function is an intelligent wrapper to dput and does the following:

  • Automatically samples a large data set (based on size and class. Sample size can be adjusted)
  • Creates a dput output
  • Allows you to specify which columns to export
  • Appends to the front of it objName <- ..., so that it can be easily copy+pasted, but...
  • If working on a Mac, the output is automagically copied to the clipboard, so that you can simply run it and then paste it to your question.

The source is available here:


Example:

# sample data
DF <- data.frame(id=rep(LETTERS, each=4)[1:100], replicate(100, sample(1001, 100)), Class=sample(c("Yes", "No"), 100, TRUE))

DF is about 100 x 102. I want to sample 10 rows and a few specific columns

reproduce(DF, cols=c("id", "X1", "X73", "Class"))  # I could also specify the column number.

Gives the following output:

This is what the sample looks like:

    id  X1 X73 Class
1    A 266 960   Yes
2    A 373 315    No            Notice the selection split
3    A 573 208    No           (which can be turned off)
4    A 907 850   Yes
5    B 202  46   Yes
6    B 895 969   Yes   <~~~ 70 % of selection is from the top rows
7    B 940 928    No
98   Y 371 171   Yes
99   Y 733 364   Yes   <~~~ 30 % of selection is from the bottom rows.
100  Y 546 641    No


    ==X==============================================================X==
         Copy+Paste this part. (If on a Mac, it is already copied!)
    ==X==============================================================X==

 DF <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 25L, 25L, 25L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y"), class = "factor"), X1 = c(266L, 373L, 573L, 907L, 202L, 895L, 940L, 371L, 733L, 546L), X73 = c(960L, 315L, 208L, 850L, 46L, 969L, 928L, 171L, 364L, 641L), Class = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("No", "Yes"), class = "factor")), .Names = c("id", "X1", "X73", "Class"), class = "data.frame", row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 98L, 99L, 100L))

    ==X==============================================================X==

Notice also that the entirety of the output is in a nice single, long line, not a tall paragraph of chopped up lines.
This makes it easier to read on Stack Overflow questions posts and also easier to copy+paste.


Update Oct 2013:

You can now specify how many lines of text output will take up (i.e., what you will paste into Stack Overflow). Use the lines.out=n argument for this. Example:

reproduce(DF, cols=c(1:3, 17, 23), lines.out=7) yields:

    ==X==============================================================X==
         Copy+Paste this part. (If on a Mac, it is already copied!)
    ==X==============================================================X==

 DF <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 25L,25L, 25L), .Label
      = c("A", "B", "C", "D", "E", "F", "G", "H","I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U","V", "W", "X", "Y"), class = "factor"),
      X1 = c(809L, 81L, 862L,747L, 224L, 721L, 310L, 53L, 853L, 642L),
      X2 = c(926L, 409L,825L, 702L, 803L, 63L, 319L, 941L, 598L, 830L),
      X16 = c(447L,164L, 8L, 775L, 471L, 196L, 30L, 420L, 47L, 327L),
      X22 = c(335L,164L, 503L, 407L, 662L, 139L, 111L, 721L, 340L, 178L)), .Names = c("id","X1",
      "X2", "X16", "X22"), class = "data.frame", row.names = c(1L,2L, 3L, 4L, 5L, 6L, 7L, 98L, 99L, 100L))

    ==X==============================================================X==
甩你一脸翔 2024-11-13 06:28:20

这是一个很好的

最重要的一点是:编写一小段代码,我们可以运行它来查看问题所在。一个有用的函数是 dput(),但如果您有非常大的数据,那么您可能想要制作一个小样本数据集或仅使用前 10 行左右。

编辑:

此外,请确保您确定问题出在您自己身上。该示例不应是带有“On line 200 There is an error”的完整 R 脚本。如果您使用 R(我喜欢 browser())和 Google 中的调试工具,那么您应该能够真正确定问题出在哪里,并重现一个简单的示例,其中出现同样的问题。

Here is a good guide.

The most important point is: Make a small piece of code that we can run to see what the problem is. A useful function for this is dput(), but if you have very large data, then you might want to make a small sample dataset or only use the first 10 lines or so.

EDIT:

Also, make sure that you identified where the problem is yourself. The example should not be an entire R script with "On line 200 there is an error". If you use the debugging tools in R (I love browser()) and Google, then you should be able to really identify where the problem is and reproduce a trivial example in which the same thing goes wrong.

枯寂 2024-11-13 06:28:20

R-help 邮件列表有一个发帖指南,涵盖提问和回答问题,包括生成数据的示例:

示例:有时它会有所帮助
提供一个小例子,某人
实际上可以运行。例如:

如果我有一个矩阵 x 如下:

  > x <- matrix(1:8, nrow=4, ncol=2,
                dimnames=list(c("A","B","C","D"), c("x","y"))
  > x
    x y
  A 1 5
  B 2 6
  C 3 7
  D 4 8
  >

如何将其转换为数据框
有 8 行和 3 列,名为
'row'、'col' 和 'value',其中有
维度名称作为“row”和“col”的值,如下所示:

  > x.df
     row col value
  1    A   x      1

...
(答案可能是:

  > x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
                    varying=list(colnames(x)), times=colnames(x),
                    v.names="value", timevar="col", idvar="row")

)

这个词尤其重要。您应该瞄准一个最小可重现的示例,这意味着数据和代码应该尽可能简单来解释问题。

编辑:漂亮的代码比丑陋的代码更容易阅读。使用样式指南

The R-help mailing list has a posting guide which covers both asking and answering questions, including an example of generating data:

Examples: Sometimes it helps to
provide a small example that someone
can actually run. For example:

If I have a matrix x as follows:

  > x <- matrix(1:8, nrow=4, ncol=2,
                dimnames=list(c("A","B","C","D"), c("x","y"))
  > x
    x y
  A 1 5
  B 2 6
  C 3 7
  D 4 8
  >

how can I turn it into a dataframe
with 8 rows, and three columns named
'row', 'col', and 'value', which have
the dimension names as the values of 'row' and 'col', like this:

  > x.df
     row col value
  1    A   x      1

...
(To which the answer might be:

  > x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
                    varying=list(colnames(x)), times=colnames(x),
                    v.names="value", timevar="col", idvar="row")

)

The word small is especially important. You should be aiming for a minimal reproducible example, which means that the data and the code should be as simple as possible to explain the problem.

EDIT: Pretty code is easier to read than ugly code. Use a style guide.

始终不够 2024-11-13 06:28:20

从 R.2.14(我猜)开始,您可以将数据文本表示直接提供给 read.table

 df <- read.table(header=TRUE, 
  text="Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
") 

Since R.2.14 (I guess) you can feed your data text representation directly to read.table:

 df <- read.table(header=TRUE, 
  text="Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
") 
飘落散花 2024-11-13 06:28:20

有时,无论您多么努力,问题确实无法用较小的数据重现,并且对于合成数据也不会发生(尽管它有助于展示您如何生成没有的合成数据集

  • 将数据发布到网络上的某个地方并提供 URL 可能是必要的。
  • 如果数据不能向公众发布但可以共享,那么您可以通过电子邮件将其发送给感兴趣的各方(尽管这会减少愿意工作的人数)上)。
  • 我实际上还没有看到这样做,因为无法发布数据的人对以任何形式发布数据很敏感,但在某些情况下,如果数据被充分匿名/加扰/稍微损坏,人们仍然可以发布数据,这似乎是合理的以某种方式。
  • 您还可以尝试提供模拟数据的代码。
    • 如果在您的数据中提供协变量/预测变量是可以的(从隐私/知识产权的角度来看并且逻辑上可行),但您不想共享响应变量,并且您有一个适合数据的模型,那么您可能会能够使用 simulate() 方法根据估计参数生成模拟的新样本数据。
    • 对于从头模拟数据生成包,例如faux包< /a> 对于因子设计或 wakefield 包 非常有用。 simulate() (lme4) 或 simulate_new() (glmmTMB) 函数可以模拟给定预测变量的响应,并且参数,例如:
set.seed(101)
## simulate covariates/predictors/experimental design
dd <- expand.grid(f1 = factor(1:3), f2 = factor(LETTERS[1:2]),
                  rep = 1:10)
dd$x <- rnorm(nrow(dd))
ss <- simulate_new( ~ f1*f2*x,
       newdata = dd,
       newparams = list(beta = rnorm(12)),
       family = poisson)
dd$y <- ss[[1]]

glmmTMB包还有一个小插图,更详细地描述了模拟过程。

如果您不能做到其中任何一个,那么您可能需要聘请顾问来解决您的问题...

编辑:关于匿名/加扰的两个有用的SO问题:

Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses).

  • Posting the data to the web somewhere and providing a URL may be necessary.
  • If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it).
  • I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way.
  • You can also try to provide code that simulates data.
    • If providing the covariates/predictors in your data is OK (from a privacy/IP perspective and logistically feasible) but you don't want to share the response variable, and you have a model that fits the data, you may be able to use the simulate() method to generate new sample data simulated from the estimated parameters.
    • For de novo simulation data-generation packages, such as the faux package for factorial designs or the wakefield package, are useful. The simulate() (lme4) or simulate_new() (glmmTMB) functions can simulate responses given predictor variables and parameters, e.g.:
set.seed(101)
## simulate covariates/predictors/experimental design
dd <- expand.grid(f1 = factor(1:3), f2 = factor(LETTERS[1:2]),
                  rep = 1:10)
dd$x <- rnorm(nrow(dd))
ss <- simulate_new( ~ f1*f2*x,
       newdata = dd,
       newparams = list(beta = rnorm(12)),
       family = poisson)
dd$y <- ss[[1]]

The glmmTMB package also has a vignette that describes the simulation process in more detail.

If you can't do any of these then you probably need to hire a consultant to solve your problem ...

edit: Two useful SO questions for anonymization/scrambling:

谁对谁错谁最难过 2024-11-13 06:28:20

到目前为止,对于可重复性部分来说,答案显然非常好。这只是为了澄清可重现的示例不能也不应该是问题的唯一组成部分。不要忘记解释你希望它是什么样子以及你的问题的轮廓,而不仅仅是到目前为止你是如何尝试实现这一目标的。代码还不够;你也需要言语。

这是一个要避免做什么的可重现示例(取自真实示例,更改名称是为了保护无辜者):


以下是示例数据和我遇到问题的部分函数。

code
code
code
code
code (40 or so lines of it)

我怎样才能做到这一点?


The answers so far are obviously great for the reproducibility part. This is merely to clarify that a reproducible example cannot and should not be the sole component of a question. Don't forget to explain what you want it to look like and the contours of your problem, not just how you have attempted to get there so far. Code is not enough; you need words also.

Here's a reproducible example of what to avoid doing (drawn from a real example, names changed to protect the innocent):


The following is sample data and part of function I have trouble with.

code
code
code
code
code (40 or so lines of it)

How can I achieve this ?


只涨不跌 2024-11-13 06:28:20

我有一个非常简单有效的方法来制作上面没有提到的 R 示例。
您可以首先定义您的结构。例如,

mydata <- data.frame(a=character(0), b=numeric(0),  c=numeric(0), d=numeric(0))

>fix(mydata)

当您执行 'fix' 命令时,您将看到此弹出框

然后您可以手动输入数据。这对于较小的例子比大的例子更有效。

I have a very easy and efficient way to make a R example that has not been mentioned above.
You can define your structure firstly. For example,

mydata <- data.frame(a=character(0), b=numeric(0),  c=numeric(0), d=numeric(0))

>fix(mydata)

When you execute 'fix' command, you will get this pop-up box

Then you can input your data manually. This is efficient for smaller examples rather than big ones.

栩栩如生 2024-11-13 06:28:20

指南:


您提出问题的主要目标应该是让读者尽可能轻松地理解并在他们的系统上重现您的问题。为此:

  1. 提供输入数据
  2. 提供预期输出
  3. 简洁地解释您的问题
    • 如果您有超过 20 行文本 + 代码,您可能可以返回并简化
    • 在保留问题/错误的同时尽可能简化代码

这确实需要一些工作,但这似乎是一个公平的权衡,因为你问了别人为你做事。

提供数据:


内置数据集

到目前为止最好的选择是依赖内置数据集。这使得其他人很容易解决您的问题。在 R 提示符处输入 data() 以查看可用的数据。一些经典的例子:

  • iris
  • mtcars
  • ggplot2::diamonds (外部包,但几乎每个人都有)

检查内置数据集以找到适合您问题的数据集。

如果您可以重新表述您的问题以使用内置数据集,您更有可能获得好的答案(和赞成票)。

自行生成的数据

如果您的问题特定于现有数据集中未表示的数据类型,请提供 R 代码,以生成您的问题表现出来的最小可能的数据集。例如,

set.seed(1)  # important to make random data reproducible
myData <- data.frame(a=sample(letters[1:5], 20, rep=T), b=runif(20))

尝试回答我的问题的人可以复制/粘贴这两行并立即开始解决问题。

dput

作为最后的手段,您可以使用dput 将数据对象转换为R 代码(例如dput(myData))。我说这是“最后的手段”,因为 dput 的输出通常相当笨重,复制粘贴很烦人,并且掩盖了问题的其余部分。

提供预期输出:


有人曾经说过:

一张预期输出的图片抵得上1000个字

--圣人

如果你可以在你的问题中添加类似“我期望得到这个结果”这样的

   cyl   mean.hp
1:   6 122.28571
2:   4  82.63636
3:   8 209.21429

内容,人们更有可能很快理解你想要做什么。如果您的预期结果很大且难以处理,那么您可能没有充分考虑如何简化您的问题(见下文)。

简洁地解释你的问题


最主要的事情是在提出问题之前尽可能地简化你的问题。重新构建问题以使用内置数据集将在这方面有很大帮助。你还会经常发现,只要经历简化的过程,你就能回答自己的问题。

以下是一些好问题的示例:

在这两种情况下,用户的问题几乎肯定不是以及他们提供的简单示例。相反,他们抽象了问题的本质,并将其应用于简单的数据集来提出他们的问题。

为什么这个问题还有另一个答案?


这个答案重点关注我认为的最佳实践:使用内置数据集并以最小的形式提供您期望的结果。最突出的答案集中在其他方面。我不认为这个答案会引起任何关注。这只是为了让我可以在新手问题的评论中链接到它。

Guidelines:


Your main objective in crafting your questions should be to make it as easy as possible for readers to understand and reproduce your problem on their systems. To do so:

  1. Provide input data
  2. Provide expected output
  3. Explain your problem succinctly
    • if you have over 20 lines of text + code, you can probably go back and simplify
    • simplify your code as much as possible while preserving the problem/error

This does take some work, but it seems like a fair trade-off since you ask others to do work for you.

Providing Data:


Built-in Data Sets

The best option by far is to rely on built-in datasets. This makes it very easy for others to work on your problem. Type data() at the R prompt to see what data is available to you. Some classic examples:

  • iris
  • mtcars
  • ggplot2::diamonds (external package, but almost everyone has it)

Inspect the built-in datasets to find one suitable for your problem.

If you can rephrase your problem to use the built-in datasets, you are much more likely to get good answers (and upvotes).

Self Generated Data

If your problem is specific to a type of data that is not represented in the existing data sets, then provide the R code that generates the smallest possible dataset that your problem manifests itself on. For example

set.seed(1)  # important to make random data reproducible
myData <- data.frame(a=sample(letters[1:5], 20, rep=T), b=runif(20))

Someone trying to answer my question can copy/paste those two lines and start working on the problem immediately.

dput

As a last resort, you can use dput to transform a data object to R code (e.g. dput(myData)). I say as a "last resort" because the output of dput is often fairly unwieldy, annoying to copy-paste, and obscures the rest of your question.

Provide Expected Output:


Someone once said:

A picture of expected output is worth 1000 words

-- a sage person

If you can add something like "I expected to get this result":

   cyl   mean.hp
1:   6 122.28571
2:   4  82.63636
3:   8 209.21429

to your question, people are much more likely to understand what you are trying to do quickly. If your expected result is large and unwieldy, then you probably haven't thought enough about how to simplify your problem (see next).

Explain Your Problem Succinctly


The main thing to do is simplify your problem as much as possible before you ask your question. Re-framing the problem to work with the built-in datasets will help a lot in this regard. You will also often find that just by going through the process of simplification, you will answer your own problem.

Here are some examples of good questions:

In both cases, the user's problems are almost certainly not with the simple examples they provide. Rather they abstracted the nature of their problem and applied it to a simple data set to ask their question.

Why Yet Another Answer To This Question?


This answer focuses on what I think is the best practice: use built-in data sets and provide what you expect as a result in a minimal form. The most prominent answers focus on other aspects. I don't expect this answer to rising to any prominence; this is here solely so that I can link to it in comments to newbie questions.

烟酉 2024-11-13 06:28:20

要快速创建数据的 dput,您只需将(一部分)数据复制到剪贴板,然后在 R 中运行以下命令:

对于 Excel 中的数据:

dput(read.table("clipboard", sep="\t", header=TRUE))

对于 .txt 中的数据 文件:

dput(read.table("clipboard", sep="", header=TRUE))

如果需要,您可以更改后者中的 sep
当然,只有当您的数据位于剪贴板中时,这才有效。

To quickly create a dput of your data you can just copy (a piece of) the data to your clipboard and run the following in R:

For data in Excel:

dput(read.table("clipboard", sep="\t", header=TRUE))

For data in a .txt file:

dput(read.table("clipboard", sep="", header=TRUE))

You can change the sep in the latter if necessary.
This will only work if your data is in the clipboard of course.

煮茶煮酒煮时光 2024-11-13 06:28:20

可重现的代码是获得帮助的关键。然而,许多用户可能对粘贴一小部分数据持怀疑态度。例如,他们可能正在处理敏感数据或收集用于研究论文的原始数据。

出于某种原因,我认为在公开粘贴数据之前有一个方便的功能来“变形”我的数据会很好。 SciencesPo 包中的 anonymize 函数非常愚蠢,但对我来说,它与 dput 函数配合得很好。

install.packages("SciencesPo")

dt <- data.frame(
    Z = sample(LETTERS,10),
    X = sample(1:10),
    Y = sample(c("yes", "no"), 10, replace = TRUE)
)
> dt
   Z  X   Y
1  D  8  no
2  T  1 yes
3  J  7  no
4  K  6  no
5  U  2  no
6  A 10 yes
7  Y  5  no
8  M  9 yes
9  X  4 yes
10 Z  3  no

然后我将其匿名化:

> anonymize(dt)
     Z    X  Y
1   b2  2.5 c1
2   b6 -4.5 c2
3   b3  1.5 c1
4   b4  0.5 c1
5   b7 -3.5 c1
6   b1  4.5 c2
7   b9 -0.5 c1
8   b5  3.5 c2
9   b8 -1.5 c2
10 b10 -2.5 c1

在应用匿名化和 dput 命令之前,人们可能还想对一些变量而不是整个数据进行采样。

    # Sample two variables without replacement
> anonymize(sample.df(dt,5,vars=c("Y","X")))
   Y    X
1 a1 -0.4
2 a1  0.6
3 a2 -2.4
4 a1 -1.4
5 a2  3.6

Reproducible code is the key to get help. However, there are many users that might be sceptical of pasting even a chunk of their data. For instance, they could be working with sensitive data or on original data collected to use in a research paper.

For any reason, I thought it would be nice to have a handy function for "deforming" my data before pasting it publicly. The anonymize function from the package SciencesPo is very silly, but for me it works nicely with the dput function.

install.packages("SciencesPo")

dt <- data.frame(
    Z = sample(LETTERS,10),
    X = sample(1:10),
    Y = sample(c("yes", "no"), 10, replace = TRUE)
)
> dt
   Z  X   Y
1  D  8  no
2  T  1 yes
3  J  7  no
4  K  6  no
5  U  2  no
6  A 10 yes
7  Y  5  no
8  M  9 yes
9  X  4 yes
10 Z  3  no

Then I anonymize it:

> anonymize(dt)
     Z    X  Y
1   b2  2.5 c1
2   b6 -4.5 c2
3   b3  1.5 c1
4   b4  0.5 c1
5   b7 -3.5 c1
6   b1  4.5 c2
7   b9 -0.5 c1
8   b5  3.5 c2
9   b8 -1.5 c2
10 b10 -2.5 c1

One may also want to sample a few variables instead of the whole data before applying the anonymization and dput command.

    # Sample two variables without replacement
> anonymize(sample.df(dt,5,vars=c("Y","X")))
   Y    X
1 a1 -0.4
2 a1  0.6
3 a2 -2.4
4 a1 -1.4
5 a2  3.6
画▽骨i 2024-11-13 06:28:20

通常,您需要一些数据作为示例,但是,您不想发布确切的数据。要使用已建立的库中的某些现有数据框,请使用数据命令导入它。

例如,

data(mtcars)

然后做问题

names(mtcars)
your problem demostrated on the mtcars data set

Often you need some data for an example, however, you don't want to post your exact data. To use some existing data.frame in established library, use data command to import it.

e.g.,

data(mtcars)

and then do the problem

names(mtcars)
your problem demostrated on the mtcars data set
魔法少女 2024-11-13 06:28:20

如果您有一个大型数据集,无法使用 dput() 轻松放入脚本中,请将数据发布到 Pastebin 并使用 read.table 加载它们:

d <- read.table("http://pastebin.com/raw.php?i=m1ZJuKLH")

灵感来自 由 Henrik

If you have a large dataset which cannot be easily put to the script using dput(), post your data to pastebin and load them using read.table:

d <- read.table("http://pastebin.com/raw.php?i=m1ZJuKLH")

Inspired by Henrik.

べ繥欢鉨o。 2024-11-13 06:28:20

我正在开发 wakefield来满足快速共享可重现的需求数据,有时dput对于较小的数据集工作得很好,但我们处理的许多问题要大得多,通过dput共享如此大的数据集是不切实际的。

关于:

wakefield 允许用户共享最少的代码来重现数据。用户设置 n(行数)并指定任意数量的预设变量函数(目前有 70 个)来模拟真实的 if 数据(例如性别、年龄、收入等)

安装:

目前(2015-06-11),wakefield 是一个 GitHub 包,但在编写单元测试后最终会转到 CRAN。要快速安装,请使用:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")

示例:

这是一个示例:

r_data_frame(
    n = 500,
    id,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died
)

这会产生:

    ID  Race Age    Sex     Hour  IQ Height  Died
1  001 White  33   Male 00:00:00 104     74  TRUE
2  002 White  24   Male 00:00:00  78     69 FALSE
3  003 Asian  34 Female 00:00:00 113     66  TRUE
4  004 White  22   Male 00:00:00 124     73  TRUE
5  005 White  25 Female 00:00:00  95     72  TRUE
6  006 White  26 Female 00:00:00 104     69  TRUE
7  007 Black  30 Female 00:00:00 111     71 FALSE
8  008 Black  29 Female 00:00:00 100     64  TRUE
9  009 Asian  25   Male 00:30:00 106     70 FALSE
10 010 White  27   Male 00:30:00 121     68 FALSE
.. ...   ... ...    ...      ... ...    ...   ...

I am developing the wakefield package to address this need to quickly share reproducible data, sometimes dput works fine for smaller data sets but many of the problems we deal with are much larger, sharing such a large data set via dput is impractical.

About:

wakefield allows the user to share minimal code to reproduce data. The user sets n (number of rows) and specifies any number of preset variable functions (there are currently 70) that mimic real if data (things like gender, age, income etc.)

Installation:

Currently (2015-06-11), wakefield is a GitHub package but will go to CRAN eventually after unit tests are written. To install quickly, use:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")

Example:

Here is an example:

r_data_frame(
    n = 500,
    id,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died
)

This produces:

    ID  Race Age    Sex     Hour  IQ Height  Died
1  001 White  33   Male 00:00:00 104     74  TRUE
2  002 White  24   Male 00:00:00  78     69 FALSE
3  003 Asian  34 Female 00:00:00 113     66  TRUE
4  004 White  22   Male 00:00:00 124     73  TRUE
5  005 White  25 Female 00:00:00  95     72  TRUE
6  006 White  26 Female 00:00:00 104     69  TRUE
7  007 Black  30 Female 00:00:00 111     71 FALSE
8  008 Black  29 Female 00:00:00 100     64  TRUE
9  009 Asian  25   Male 00:30:00 106     70 FALSE
10 010 White  27   Male 00:30:00 121     68 FALSE
.. ...   ... ...    ...      ... ...    ...   ...
心碎无痕… 2024-11-13 06:28:20

如果您的数据中有一个或多个 factor 变量,并且您希望使用 dput(head(mydata)) 重现这些变量,请考虑添加 droplevels< /code> 到它,以便最小化数据集中不存在的因素级别不会包含在您的 dput 输出中,以使示例最小

dput(droplevels(head(mydata)))

If you have one or more factor variable(s) in your data that you want to make reproducible with dput(head(mydata)), consider adding droplevels to it, so that levels of factors that are not present in the minimized data set are not included in your dput output, in order to make the example minimal:

dput(droplevels(head(mydata)))
琴流音 2024-11-13 06:28:20

最初的帖子提到了 Datacamp 现已退役的 r-fiddle 服务。它已被重新命名为 datacamp light,并且不能像我的答案所示那样轻松嵌入。

我想知道 http://old.r-fiddle.org/ 链接是否可以是分享问题的一种非常巧妙的方式。它收到一个唯一的 ID,例如,人们甚至可以考虑将其嵌入到 SO 中。

The original post referred to the now retired r-fiddle service from datacamp. It has been rebranded as datacamp light and can not as easily embedded as indicated by my answer.

I wonder if an http://old.r-fiddle.org/ link could be a very neat way of sharing a problem. It receives a unique ID like and one could even think about embedding it in SO.

迷爱 2024-11-13 06:28:20

请不要像这样粘贴控制台输出:

If I have a matrix x as follows:
> x <- matrix(1:8, nrow=4, ncol=2,
            dimnames=list(c("A","B","C","D"), c("x","y")))
> x
  x y
A 1 5
B 2 6
C 3 7
D 4 8
>

How can I turn it into a dataframe with 8 rows, and three
columns named `row`, `col`, and `value`, which have the
dimension names as the values of `row` and `col`, like this:
> x.df
    row col value
1    A   x      1
...
(To which the answer might be:
> x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
+                varying=list(colnames(x)), times=colnames(x),
+                v.names="value", timevar="col", idvar="row")
)

我们无法直接复制粘贴它。

为了使问题和答案能够正确重现,请尝试删除 + &在发布之前 > 并为输出和注释添加 # ,如下所示:

#If I have a matrix x as follows:
x <- matrix(1:8, nrow=4, ncol=2,
            dimnames=list(c("A","B","C","D"), c("x","y")))
x
#  x y
#A 1 5
#B 2 6
#C 3 7
#D 4 8

# How can I turn it into a dataframe with 8 rows, and three
# columns named `row`, `col`, and `value`, which have the
# dimension names as the values of `row` and `col`, like this:

#x.df
#    row col value
#1    A   x      1
#...
#To which the answer might be:

x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
                varying=list(colnames(x)), times=colnames(x),
                v.names="value", timevar="col", idvar="row")

还有一件事,如果您使用了某个包中的任何函数,请提及该库。

Please do not paste your console outputs like this:

If I have a matrix x as follows:
> x <- matrix(1:8, nrow=4, ncol=2,
            dimnames=list(c("A","B","C","D"), c("x","y")))
> x
  x y
A 1 5
B 2 6
C 3 7
D 4 8
>

How can I turn it into a dataframe with 8 rows, and three
columns named `row`, `col`, and `value`, which have the
dimension names as the values of `row` and `col`, like this:
> x.df
    row col value
1    A   x      1
...
(To which the answer might be:
> x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
+                varying=list(colnames(x)), times=colnames(x),
+                v.names="value", timevar="col", idvar="row")
)

We can not copy-paste it directly.

To make questions and answers properly reproducible, try to remove + & > before posting it and put # for outputs and comments like this:

#If I have a matrix x as follows:
x <- matrix(1:8, nrow=4, ncol=2,
            dimnames=list(c("A","B","C","D"), c("x","y")))
x
#  x y
#A 1 5
#B 2 6
#C 3 7
#D 4 8

# How can I turn it into a dataframe with 8 rows, and three
# columns named `row`, `col`, and `value`, which have the
# dimension names as the values of `row` and `col`, like this:

#x.df
#    row col value
#1    A   x      1
#...
#To which the answer might be:

x.df <- reshape(data.frame(row=rownames(x), x), direction="long",
                varying=list(colnames(x)), times=colnames(x),
                v.names="value", timevar="col", idvar="row")

One more thing, if you have used any function from certain package, mention that library.

冷月断魂刀 2024-11-13 06:28:20

您可以使用 reprex 来完成此操作。

正如 mt1022 指出的,“...用于生成最小的、可重现的示例的好包是来自 "reprex" href="https://www.tidyverse.org" rel="noreferrer">tidyverse"。

根据Tidyverse

“reprex”的目标是以其他人可以运行它并感受到你的痛苦的方式打包有问题的代码。

tidyverse 网站上给出了一个示例。

library(reprex)
y <- 1:4
mean(y)
reprex() 

我认为这是创建可重现示例的最简单的方法

You can do this using reprex.

As mt1022 noted, "... good package for producing minimal, reproducible example is "reprex" from tidyverse".

According to Tidyverse:

The goal of "reprex" is to package your problematic code in such a way that other people can run it and feel your pain.

An example is given on tidyverse web site.

library(reprex)
y <- 1:4
mean(y)
reprex() 

I think this is the simplest way to create a reproducible example.

薯片软お妹 2024-11-13 06:28:20

除了我发现非常有趣的上述所有答案之外,有时可能非常简单,正如这里讨论的那样: 如何制作一个最小的可重现示例以获得有关 R 的帮助

制作随机向量的方法有很多 <一href="https://stackoverflow.com/questions/17772505/create-a-100-number-vector-with-random-values-in-r-rounded-to-2-decimals">创建一个 100 数字向量R中的随机值四舍五入到2位小数或R中的随机矩阵:

mydf1<- matrix(rnorm(20),nrow=20,ncol=5)

请注意,有时由于维度等各种原因,共享给定的数据非常困难。但是,以上所有答案非常棒,当人们想要制作一个可重现的数据示例时,思考和使用它们非常重要。但请注意,为了使数据与原始数据一样具有代表性(以防OP无法共享原始数据),最好在数据示例中添加一些信息,如下所示(如果我们将数据称为mydf1)

class(mydf1)
# this shows the type of the data you have
dim(mydf1)
# this shows the dimension of your data

此外,应该知道数据的类型、长度和属性,可以是数据结构

#found based on the following
typeof(mydf1), what it is.
length(mydf1), how many elements it contains.
attributes(mydf1), additional arbitrary metadata.

#If you cannot share your original data, you can str it and give an idea about the structure of your data
head(str(mydf1))

Apart from all the above answers which I found very interesting, it could sometimes be very easy as it is discussed here: How to make a minimal reproducible example to get help with R

There are many ways to make a random vector Create a 100 number vector with random values in R rounded to 2 decimals or a random matrix in R:

mydf1<- matrix(rnorm(20),nrow=20,ncol=5)

Note that sometimes it is very difficult to share a given data because of various reasons such as dimension, etc. However, all the above answers are great, and they are very important to think about and use when one wants to make a reproducible data example. But note that in order to make data as representative as the original (in case the OP cannot share the original data), it is good to add some information with the data example as (if we call the data mydf1)

class(mydf1)
# this shows the type of the data you have
dim(mydf1)
# this shows the dimension of your data

Moreover, one should know the type, length and attributes of a data which can be Data structures

#found based on the following
typeof(mydf1), what it is.
length(mydf1), how many elements it contains.
attributes(mydf1), additional arbitrary metadata.

#If you cannot share your original data, you can str it and give an idea about the structure of your data
head(str(mydf1))
婴鹅 2024-11-13 06:28:20

以下是我的一些建议:

  • 尝试使用默认的 R 数据集
  • 如果您有自己的数据集,请将它们包含在 dput 中,以便其他人可以更轻松地帮助您
  • 不要使用 install.package() 除非确实有必要,否则如果你只使用 requirelibrary 人们会理解的
  • 尽量简洁,

    • 有一些数据集
    • 尝试尽可能简单地描述您需要的输出
    • 提问之前先自己做一下
  • 上传图像很容易,因此如果您有
  • 任何错误, 请上传绘图可能有

所有这些都是可重现示例的一部分。

Here are some of my suggestions:

  • Try to use default R datasets
  • If you have your own dataset, include them with dput, so others can help you more easily
  • Do not use install.package() unless it is really necessary, people will understand if you just use require or library
  • Try to be concise,

    • Have some dataset
    • Try to describe the output you need as simply as possible
    • Do it yourself before you ask the question
  • It is easy to upload an image, so upload plots if you have
  • Also include any errors you may have

All these are part of a reproducible example.

请别遗忘我 2024-11-13 06:28:20

最好使用 testthat 包中的函数来显示您期望发生的情况。因此,其他人可以更改您的代码,直到它运行没有错误为止。这减轻了那些想要帮助您的人的负担,因为这意味着他们不必解码您的文字描述。例如

library(testthat)
# code defining x and y
if (y >= 10) {
    expect_equal(x, 1.23)
} else {
    expect_equal(x, 3.21)
}

,比“我认为当 y 等于或超过 10 时 x 会是 1.23,否则是 3.21,但我没有得到任何结果”更清楚。即使在这个愚蠢的例子中,我认为代码比文字更清晰。使用 testthat 可以让你的帮助者专注于代码,这可以节省时间,并且可以让他们在发布问题之前知道他们已经解决了你的问题

It's a good idea to use functions from the testthat package to show what you expect to occur. Thus, other people can alter your code until it runs without error. This eases the burden of those who would like to help you, because it means they don't have to decode your textual description. For example

library(testthat)
# code defining x and y
if (y >= 10) {
    expect_equal(x, 1.23)
} else {
    expect_equal(x, 3.21)
}

is clearer than "I think x would come out to be 1.23 for y equal to or exceeding 10, and 3.21 otherwise, but I got neither result". Even in this silly example, I think the code is clearer than the words. Using testthat lets your helper focus on the code, which saves time, and it provides a way for them to know they have solved your problem, before they post it

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文