如何做出一个伟大的 r 复制的例子

When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on Stack Overflow, a reproducible example is often asked and always helpful.

What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include?

Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc.?

How does one make a great r reproducible example?

转载于:https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

23条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
程序go 2011-05-12 13:21
关注
A minimal reproducible example consists of the following items:

a minimal dataset, necessary to reproduce the error

the minimal runnable code necessary to reproduce the error, which can be run on the given dataset.

the necessary information on the used packages, R version, and system it is run on.

in the case of random processes, a seed (set by set.seed()) for reproducibility

Looking at the examples in the help files of the used functions is often helpful. In general, all the code given there fulfills the requirements of a minimal reproducible example: data is provided, minimal code is provided, and everything is runnable.

Producing a minimal dataset

For most cases, this can be easily done by just providing a vector/data frame with some values. Or you can use one of the built-in datasets, which are provided with most packages.
A comprehensive list of built-in datasets can be seen with library(help = "datasets"). There is a short description to every dataset and more information can be obtained for example with ?mtcars where 'mtcars' is one of the datasets in the list. Other packages might contain additional datasets.

Making a vector is easy. Sometimes it is necessary to add some randomness to it, and there are a whole number of functions to make that. sample() can randomize a vector, or give a random vector with only a few values. letters is a useful vector containing the alphabet. This can be used for making factors.

A few examples :

random values : x <- rnorm(10) for normal distribution, x <- runif(10) for uniform distribution, ...

a permutation of some values : x <- sample(1:10) for vector 1:10 in random order.

a random factor : x <- sample(letters[1:4], 20, replace = TRUE)

For matrices, one can use matrix(), eg :

matrix(1:10, ncol = 2)

Making data frames can be done using data.frame(). One should pay attention to name the entries in the data frame, and to not make it overly complicated.

An example :

set.seed(1) Data <- data.frame( X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE) )

For some questions, specific formats can be needed. For these, one can use any of the provided as.someType functions : as.factor, as.Date, as.xts, ... These in combination with the vector and/or data frame tricks.

Copy your data

If you have some data that would be too difficult to construct using these tips, then you can always make a subset of your original data, using eg head(), subset() or the indices. Then use eg. dput() to give us something that can be put in R immediately :

> dput(head(iris,4)) structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 4L), class = "data.frame")

If your data frame has a factor with many levels, the dput output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the the subset of your data. To solve this issue, you can use the droplevels() function. Notice below how species is a factor with only one level:

> dput(droplevels(head(iris, 4))) structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = "setosa", class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 4L), class = "data.frame")

One other caveat for dput is that it will not work for keyed data.table objects or for grouped tbl_df (class grouped_df) from dplyr. In these cases you can convert back to a regular data frame before sharing, dput(as.data.frame(my_data)).

Worst case scenario, you can give a text representation that can be read in using the text parameter of read.table :

zz <- "Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa" Data <- read.table(text=zz, header = TRUE)

Producing minimal code

This should be the easy part but often isn't. What you should not do, is:

add all kind of data conversions. Make sure the provided data is already in the correct format (unless that is the problem of course)

copy-paste a whole function/chunk of code that gives an error. First, try to locate which lines exactly result in the error. More often than not you'll find out what the problem is yourself.

What you should do, is:

add which packages should be used if you use any (using library())

if you open connections or create files, add some code to close them or delete the files (using unlink())

if you change options, make sure the code contains a statement to revert them back to the original ones. (eg op <- par(mfrow=c(1,2)) ...some code... par(op) )

test run your code in a new, empty R session to make sure the code is runnable. People should be able to just copy-paste your data and your code in the console and get exactly the same as you have.

Give extra information

In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of sessionInfo() can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible also the necessary information on the setup.

If you are running R in R Studio using rstudioapi::versionInfo() can be helpful to report your RStudio version.

If you have a problem with a specific package you may want to provide the version of the package by giving the output of packageVersion("name of the package").
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(22条)

报告相同问题？

关注问题

如何做出一个伟大的 r 复制的例子 r语言
2011-05-11 11:12

回答 22 已采纳 A minimal reproducible example consists of the following items: a minimal dataset, necessary to
R语言中无法运行brm()函数 r语言
2023-03-23 20:50

回答 2 已采纳目前已经解决了，方法参照https://github.com/stan-dev/rstan/wiki/Configuring-C---Toolchain-for-Windows#rstan-compa
高分悬赏：Java语言输出一个无限循环小数的循环节，例子如下：开发语言
2020-05-21 21:13

回答 2 已采纳对于循环小数的判断，不知道大家有什么好的方法，余数检测法这篇讲解的很细。 https://blog.csdn.net/whiskey_wei/article/details/79280367
快来看；2021 年最流行的 8 种编程语言！
2021-03-18 17:05

code小生_的博客点击⬇️方“逆锋起笔”，公众号回复pdf 领取大佬们推荐的学习资料作者 | Zulie Rane策划 | 刘燕怎样判断哪种编程语言最流行？正如要挑选最受欢迎的冰激凌一样，每个人都有自己...
能用fortran语言,用numerov算法写个解微分方程的例子吗开发语言有问必答
2022-04-27 00:44

回答 3 已采纳 y''-y=x，（0<x<1,y(0)=0,y(1)=1） IMPLICIT DOUBLE PRECISION(A-H,O-Z) DIMENSION RL(1:10000),A(1:100
高分悬赏：Java语言编写一个多线程的例子程序给我，要完整，要求如下：开发语言
2020-07-14 13:44

回答 3 已采纳把http://www.mamicode.com/info-detail-50277.html 的代码粘贴上去这篇文章讲解的很详细。
关于#R语言#的问题，如何解决？ r语言
2022-10-24 23:08

回答 1 已采纳你好，这是一个机器学习的随机算法，按照一定数量进行数据分组并进行训练决策树，大概是这样。。你可以参考学习下这文章：https://blog.csdn.net/Netceor/article/detai
年最流行的 8 种编程语言！
2021-11-19 08:45

weixin_38754349的博客点击⬇️方“逆锋起笔”，公众号回复pdf 领取大佬们推荐的学习资料作者 | Zulie Rane策划 | 刘燕怎样判断哪种编程语言最流行？正如要挑选最受欢迎的冰激凌一样，每个人都有自己的...
高分悬赏：Java语言求高人编写一个完整的调用的例子，利用compareTo来实现快速排序开发语言
2020-07-14 15:15

回答 2 已采纳参考代码： ``` //一趟快速排序的过程 public static int procedure(Object[] arr,int start,int end){ Obje
不同编程语言开发的系统如何对接 android java php
2018-03-19 10:12

回答 8 已采纳不需要，现在开发基本上都是前后分离的，直接通过json进行数据交互，也就是说你应该吧PHP的后端和界面进行分离，然后通过json或XML(推荐json)进行交换数据，然后开发android的时候你只需
Python从0到100（二）：Python语言介绍及第一个Pyhon程序
2024-03-04 16:02

是Dream呀的博客 Python从0到100（二）：Python语言介绍及第一个Pyhon程序。与C和Java比，Python的学习成本和难度曲线不是低一点，更适合新手入门，自底向上的技术攀爬路线。
最具潜力的编程语言GO有新书啦！
2019-07-09 09:00

人邮异步社区的博客互联网时代的来临，改变甚至颠覆了很多东西。...在云时代，掌握分布式编程已经成为软件工程师的基本技能，而基于Go语言构建的Docker、Kubernetes等系统正是将云时代推向顶峰的关键力量。今天，Go...
基因表达式编程gep_最具潜力的编程语言GO有新书啦
2020-12-06 22:20

weixin_39764487的博客互联网时代的来临，改变甚至颠覆了很多东西。...在云时代，掌握分布式编程已经成为软件工程师的基本技能，而基于Go语言构建的Docker、Kubernetes等系统正是将云时代推向顶峰的关键力量。今天，Go语...
Python要点总结，我使用了100个小例子！
2020-05-07 15:21

Python新世界的博客类型检查是一个验证和施加类型约束的过程，编译器或解释器通常在编译或运行阶段做类型检查。例如，你不能拿一个string类型值除以浮点数。用更简单的术语，类型检查仅仅就是查看变量和它们的类型，然后说这个表达式...
没有解决我的问题, 去提问

悬赏问题

¥30 Matlab打开默认名称带有/的光谱数据
¥50 easyExcel模板动态单元格合并列
¥15 res.rows如何取值使用
¥15 在odoo17开发环境中，怎么实现库存管理系统，或独立模块设计与AGV小车对接？开发方面应如何设计和开发？请详细解释MES或WMS在与AGV小车对接时需完成的设计和开发
¥15 CSP算法实现EEG特征提取，哪一步错了？
¥15 游戏盾如何溯源服务器真实ip?需要30个字。后面的字是凑数的
¥15 vue3前端取消收藏的不会引用collectId
¥15 delphi7 HMAC_SHA256方式加密
¥15 关于#qt#的问题：我想实现qcustomplot完成坐标轴
¥15 下列c语言代码为何输出了多余的空格

如何做出一个伟大的 r 复制的例子

23条回答 默认 最新

Producing a minimal dataset

Copy your data

Producing minimal code

Give extra information

悬赏问题

23条回答默认最新