如何做出一个伟大的 r 复制的例子

When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on Stack Overflow, a reproducible example is often asked and always helpful.

What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include?

Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc.?

How does one make a great r reproducible example?

转载于:https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

22条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
游.程 2011-05-12 13:21
关注
A minimal reproducible example consists of the following items:

a minimal dataset, necessary to reproduce the error

the minimal runnable code necessary to reproduce the error, which can be run on the given dataset.

the necessary information on the used packages, R version, and system it is run on.

in the case of random processes, a seed (set by set.seed()) for reproducibility

Looking at the examples in the help files of the used functions is often helpful. In general, all the code given there fulfills the requirements of a minimal reproducible example: data is provided, minimal code is provided, and everything is runnable.

Producing a minimal dataset

For most cases, this can be easily done by just providing a vector/data frame with some values. Or you can use one of the built-in datasets, which are provided with most packages.
A comprehensive list of built-in datasets can be seen with library(help = "datasets"). There is a short description to every dataset and more information can be obtained for example with ?mtcars where 'mtcars' is one of the datasets in the list. Other packages might contain additional datasets.

Making a vector is easy. Sometimes it is necessary to add some randomness to it, and there are a whole number of functions to make that. sample() can randomize a vector, or give a random vector with only a few values. letters is a useful vector containing the alphabet. This can be used for making factors.

A few examples :

random values : x <- rnorm(10) for normal distribution, x <- runif(10) for uniform distribution, ...

a permutation of some values : x <- sample(1:10) for vector 1:10 in random order.

a random factor : x <- sample(letters[1:4], 20, replace = TRUE)

For matrices, one can use matrix(), eg :

matrix(1:10, ncol = 2)

Making data frames can be done using data.frame(). One should pay attention to name the entries in the data frame, and to not make it overly complicated.

An example :

set.seed(1) Data <- data.frame( X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE) )

For some questions, specific formats can be needed. For these, one can use any of the provided as.someType functions : as.factor, as.Date, as.xts, ... These in combination with the vector and/or data frame tricks.

Copy your data

If you have some data that would be too difficult to construct using these tips, then you can always make a subset of your original data, using eg head(), subset() or the indices. Then use eg. dput() to give us something that can be put in R immediately :

> dput(head(iris,4)) structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 4L), class = "data.frame")

If your data frame has a factor with many levels, the dput output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the the subset of your data. To solve this issue, you can use the droplevels() function. Notice below how species is a factor with only one level:

> dput(droplevels(head(iris, 4))) structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = "setosa", class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 4L), class = "data.frame")

One other caveat for dput is that it will not work for keyed data.table objects or for grouped tbl_df (class grouped_df) from dplyr. In these cases you can convert back to a regular data frame before sharing, dput(as.data.frame(my_data)).

Worst case scenario, you can give a text representation that can be read in using the text parameter of read.table :

zz <- "Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa" Data <- read.table(text=zz, header = TRUE)

Producing minimal code

This should be the easy part but often isn't. What you should not do, is:

add all kind of data conversions. Make sure the provided data is already in the correct format (unless that is the problem of course)

copy-paste a whole function/chunk of code that gives an error. First, try to locate which lines exactly result in the error. More often than not you'll find out what the problem is yourself.

What you should do, is:

add which packages should be used if you use any (using library())

if you open connections or create files, add some code to close them or delete the files (using unlink())

if you change options, make sure the code contains a statement to revert them back to the original ones. (eg op <- par(mfrow=c(1,2)) ...some code... par(op) )

test run your code in a new, empty R session to make sure the code is runnable. People should be able to just copy-paste your data and your code in the console and get exactly the same as you have.

Give extra information

In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of sessionInfo() can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible also the necessary information on the setup.

If you are running R in R Studio using rstudioapi::versionInfo() can be helpful to report your RStudio version.

If you have a problem with a specific package you may want to provide the version of the package by giving the output of packageVersion("name of the package").
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(21条)

报告相同问题？

关注问题

如何做出一个伟大的 r 复制的例子 r语言
2011-05-11 11:12

回答 22 已采纳 A minimal reproducible example consists of the following items: a minimal dataset, necessary to
高分悬赏：Java语言输出一个无限循环小数的循环节，例子如下：开发语言
2020-05-21 21:13

回答 2 已采纳对于循环小数的判断，不知道大家有什么好的方法，余数检测法这篇讲解的很细。 https://blog.csdn.net/whiskey_wei/article/details/79280367
能用fortran语言,用numerov算法写个解微分方程的例子吗开发语言有问必答
2022-04-27 00:44

回答 3 已采纳 y''-y=x，（0<x<1,y(0)=0,y(1)=1） IMPLICIT DOUBLE PRECISION(A-H,O-Z) DIMENSION RL(1:10000),A(1:100
快来看；2021 年最流行的 8 种编程语言！
2021-03-18 17:05

code小生_的博客点击⬇️方“逆锋起笔”，公众号回复pdf 领取大佬们推荐的学习资料作者 | Zulie Rane策划 | 刘燕怎样判断哪种编程语言最流行？正如要挑选最受欢迎的冰激凌一样，每个人都有自己...
R语言中无法运行brm()函数 r语言
2023-03-23 20:50

回答 2 已采纳目前已经解决了，方法参照https://github.com/stan-dev/rstan/wiki/Configuring-C---Toolchain-for-Windows#rstan-compa
高分悬赏：Java语言编写一个多线程的例子程序给我，要完整，要求如下：开发语言
2020-07-14 13:44

回答 3 已采纳把http://www.mamicode.com/info-detail-50277.html 的代码粘贴上去这篇文章讲解的很详细。
关于#R语言#的问题，如何解决？ r语言
2022-10-24 23:08

回答 1 已采纳你好，这是一个机器学习的随机算法，按照一定数量进行数据分组并进行训练决策树，大概是这样。。你可以参考学习下这文章：https://blog.csdn.net/Netceor/article/detai
年最流行的 8 种编程语言！
2021-11-19 08:45

weixin_38754349的博客点击⬇️方“逆锋起笔”，公众号回复pdf 领取大佬们推荐的学习资料作者 | Zulie Rane策划 | 刘燕怎样判断哪种编程语言最流行？正如要挑选最受欢迎的冰激凌一样，每个人都有自己的...
高分悬赏：Java语言求高人编写一个完整的调用的例子，利用compareTo来实现快速排序开发语言
2020-07-14 15:15

回答 2 已采纳参考代码： ``` //一趟快速排序的过程 public static int procedure(Object[] arr,int start,int end){ Obje
在R语言中做多水平线性回归 r语言
2021-07-10 14:35

回答 1 已采纳不知道你这个问题是否已经解决, 如果还没有解决的话: 关于该问题，我找了一篇非常好的博客，你可以看看是否有帮助，链接：关于R语言多水平线性回归分析你还可以看下r语言参考手册中的 r语言线性模型如果你
Python从0到100（二）：Python语言介绍及第一个Pyhon程序
2024-03-04 16:02

是Dream呀的博客 Python从0到100（二）：Python语言介绍及第一个Pyhon程序。与C和Java比，Python的学习成本和难度曲线不是低一点，更适合新手入门，自底向上的技术攀爬路线。
最具潜力的编程语言GO有新书啦！
2019-07-09 09:00

人邮异步社区的博客互联网时代的来临，改变甚至颠覆了很多东西。...在云时代，掌握分布式编程已经成为软件工程师的基本技能，而基于Go语言构建的Docker、Kubernetes等系统正是将云时代推向顶峰的关键力量。今天，Go...
基因表达式编程gep_最具潜力的编程语言GO有新书啦
2020-12-06 22:20

weixin_39764487的博客互联网时代的来临，改变甚至颠覆了很多东西。...在云时代，掌握分布式编程已经成为软件工程师的基本技能，而基于Go语言构建的Docker、Kubernetes等系统正是将云时代推向顶峰的关键力量。今天，Go语...
Python要点总结，我使用了100个小例子！
2020-03-27 19:36

Python新世界的博客静态类型 vs 动态 编程语言 强类型 vs 弱类型 编程语言 1.1 类型检查类型检查是一个验证和施加类型约束的过程，编译器或解释器通常在编译或运行阶段做类型检查。例如，你不能拿一个string类型值除以...
没有解决我的问题, 去提问

悬赏问题

¥20 机器学习能否像多层线性模型一样处理嵌套数据
¥20 西门子S7-Graph,S7-300，梯形图
¥50 用易语言http 访问不了网页
¥50 safari浏览器fetch提交数据后数据丢失问题
¥15 matlab不知道怎么改，求解答！！
¥15 永磁直线电机的电流环pi调不出来
¥15 用stata实现聚类的代码
¥15 请问paddlehub能支持移动端开发吗？在Android studio上该如何部署？
¥20 docker里部署springboot项目，访问不到扬声器
¥15 netty整合springboot之后自动重连失效

如何做出一个伟大的 r 复制的例子

22条回答 默认 最新

Producing a minimal dataset

Copy your data

Producing minimal code

Give extra information

悬赏问题

22条回答默认最新