同时合并列表中的多个数据帧

I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called "var1" and "var2" in the code below). If the data.frames were identical in terms of columns, I could merely rbind, for which plyr's rbind.fill would do the job, but that's not the case with these data.

Because the merge command only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:

merge.rec <- function(.list, ...){
    if(length(.list)==1) return(.list[[1]])
    Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}

And I would call the function like so:

df <- merge.rec(my.list, by.x = c("var1", "var2"), 
                by.y = c("var1", "var2"), all = T, suffixes=c("", ""))

But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

(Incidently, I see other references to this error elsewhere with no resolution).

Is there any way to solve this?

转载于:https://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list

6个回答

Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:

library(dplyr)
x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Update June 2018: I divided the answer in three sections representing three different ways to perform the merge. You probably want to use the purrr way if you are already using the tidyverse packages. For comparison purposes below, you'll find a base R version using the same sample dataset.

Join them with reduce from the purrr package

The purrr package provides a reduce function which has a concise syntax:

library(tidyverse)
list(x, y, z) %>% reduce(left_join, by = "i")
#  A tibble: 3 x 4
#  i       j     k     l
#  <chr> <int> <int> <int>
# 1 a      1    NA     9
# 2 b      2     4    NA
# 3 c      3     5     7

You can also perform other joins, such as a full_join or inner_join:

list(x, y, z) %>% reduce(full_join, by = "i")
# A tibble: 4 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 a     1     NA     9
# 2 b     2     4      NA
# 3 c     3     5      7
# 4 d     NA    6      8

list(x, y, z) %>% reduce(inner_join, by = "i")
# A tibble: 1 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 c     3     5     7

dplyr::left_join() with base R Reduce()

list(x,y,z) %>%
    Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)

#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

Base R merge() with base R Reduce()

And for comparison purposes, here is a base R version of the left join

 Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
        list(x,y,z))
#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7
csdnceshi60
℡Wang Yan is right, but you might be able to avoid (visibly) returning a list of data frames at all by using map_dfr() or map_dfc()
3 年多之前 回复
weixin_41568127
?yb? The full_join variant works perfectly, and looks a lot less scary than the accepted answer. Not much of a speed difference, though.
接近 4 年之前 回复

Reduce makes this fairly easy:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)

Here's a fully example using some mock data:

set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
#    x  a  b         y
#12 12 NA 18        NA
#13 13 NA 19        NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352

And here's an example using these data to replicate my.list:

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]

#  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA>
#2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA>
#3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA>

Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame which does make.unique the names, causing the rbind to fail.

# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year  
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
      names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))

The merge/Reduce will then work fine.

weixin_41568174
from.. Wow; this did it! Both "Recall" and "Reduce" solutions work fine now in 2.12. Thank you. I've never really run into a bug in core R code before... I do wonder why "Recall" worked in 2.7.2 but not now.
接近 9 年之前 回复
csdnceshi58
Didn"t forge The reason it worked without match.by is because it wasn't doing field renaming on name, but rather including that in key.
接近 9 年之前 回复
weixin_41568174
from.. You've gone above and beyond -- thanks. I uploaded replication code and real data that is accessed via url to show you the problem. Thanks for showing me pastebin.
接近 9 年之前 回复
csdnceshi58
Didn"t forge Still not able to replicate the problem - see updated answer. Can you provide better sample data? Maybe just save(my.list, file='my.list.RData') and upload?
接近 9 年之前 回复
weixin_41568174
from.. Sorry that last code line is wrong. When properly rewritten as test.reduce <- Reduce(function(...) merge(..., by=match.by, all=T), my.list) I get the same old "match.names" error.
接近 9 年之前 回复
weixin_41568174
from.. No it can't be empty data frames; your code isn't right. It should be L3 <- list(empty,empty, data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)),empty,empty,empty) and then m3 = Reduce(function(...) merge(..., all=T), L3) works just fine.
接近 9 年之前 回复
weixin_41568174
from.. You're onto something. Your code runs fine above for me. And when I adapt it to mine, it runs fine too -- except that it does a merge ignoring the key variables I want. When I try to add key variables rather than leave them out, I get a new error "Error in is.null(x) : 'x' is missing". The code line is "test.reduce <- Reduce(function(...) merge(by=match.by, all=T), my.list)" where match.by are the vector of key variable names I want merged by.
接近 9 年之前 回复
csdnceshi52
妄徒之命 I suspect something happening with empty data frames. I tried out some examples like this: empty <- data.frame(x=numeric(0),a=numeric(0); L3 <- c(empty,empty,list.of.data.frames,empty,empty,empty) and got some weird stuff happening that I haven't figured out yet.
接近 9 年之前 回复
csdnceshi58
Didn"t forge Strange, I added the code that I tested it with which runs fine. I guess there is some field-renaming occurring based on the merge args you're using? The merged result must still have the relevant keys in order to be merged with the subsequent data frame.
接近 9 年之前 回复
weixin_41568174
from.. I tested this solution on R2.7.2 and I get the same match.names error. So there's some more fundamental problem with this solution and my data. I used the code: Reduce(function(x, y) merge(x, y, all=T,by.x=match.by, by.y=match.by), my.list, accumulate=F)
接近 9 年之前 回复
weixin_41568174
from.. Thanks! I saw this solution also on the link from Ramnath. Looks easy enough. But I get the following error: "Error in match.names(clabs, names(xi)) : names do not match previous names". The variables I'm matching on are all present in all the dataframes in the list, so I'm not catching what this error is telling me.
接近 9 年之前 回复

You can use recursion to do this. I haven't verified the following, but it should give you the right idea:

MergeListOfDf = function( data , ... )
{
    if ( length( data ) == 2 ) 
    {
        return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )
    }    
    return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) )
}

I will reuse the data example from @PaulRougieux

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Here's a short and sweet solution using purrr and tidyr

library(tidyverse)

 list(x, y, z) %>% 
  map_df(gather, key=key, value=value, -i) %>% 
  spread(key, value)

You can do it using merge_all in the reshape package. You can pass parameters to merge using the ... argument

reshape::merge_all(list_of_dataframes, ...)

Here is an excellent resource on different methods to merge data frames.

csdnceshi78
程序go link is dead, is there a mirror?
大约 6 年之前 回复
weixin_41568110
七度&光 I updated the original post with my attempt to use a real example from my code, but shortening up the data frames for exposition.
接近 9 年之前 回复
weixin_41568110
七度&光 Yikes. My list has 19 data frames, each about 48-50 rows and 600 columns! dput puts tons of data on screen. What's the best way to summarize?
接近 9 年之前 回复
csdnceshi67
bug^君 I think I dropped this from reshape2. Reduce + merge is just as simple.
接近 9 年之前 回复
csdnceshi63
elliott.david it would be useful to post a few lines of your original data frames, so that your error is reproducible. you can easily do it using dput.
接近 9 年之前 回复
weixin_41568110
七度&光 I'm a little confused; should I do merge_all or merge_recurse? In any case, when I try to add in my additional arguments to either, I get the error "formal argument "all" matched by multiple actual arguments".
接近 9 年之前 回复
csdnceshi63
elliott.david yes. whenever i have an idea, i always check if @hadley has already done it, and most of the times he has :-)
接近 9 年之前 回复
csdnceshi79
python小菜 looks like I just replicated merge_recurse =) good to know this function already exists.
接近 9 年之前 回复

I had a list of dataframes with no common id column.
I had missing data on many dfs. There were Null values. The dataframes were produced using table function. The Reduce, Merging, rbind, rbind.fill, and their like could not help me to my aim. My aim was to produce an understandable merged dataframe, irrelevant of the missing data and common id column.

Therefore, I made the following function. Maybe this function can help someone.

##########################################################
####             Dependencies                        #####
##########################################################

# Depends on Base R only

##########################################################
####             Example DF                          #####
##########################################################

# Example df
ex_df           <- cbind(c( seq(1, 10, 1), rep("NA", 0), seq(1,10, 1) ), 
                         c( seq(1, 7, 1),  rep("NA", 3), seq(1, 12, 1) ), 
                         c( seq(1, 3, 1),  rep("NA", 7), seq(1, 5, 1), rep("NA", 5) ))

# Making colnames and rownames
colnames(ex_df) <- 1:dim(ex_df)[2]
rownames(ex_df) <- 1:dim(ex_df)[1]

# Making an unequal list of dfs, 
# without a common id column
list_of_df      <- apply(ex_df=="NA", 2, ( table) )

it is following the function

##########################################################
####             The function                        #####
##########################################################


# The function to rbind it
rbind_null_df_lists <- function ( list_of_dfs ) {
  length_df     <- do.call(rbind, (lapply( list_of_dfs, function(x) length(x))))
  max_no        <- max(length_df[,1])
  max_df        <- length_df[max(length_df),]
  name_df       <- names(length_df[length_df== max_no,][1])
  names_list    <- names(list_of_dfs[ name_df][[1]])

  df_dfs <- list()
  for (i in 1:max_no ) {

    df_dfs[[i]]            <- do.call(rbind, lapply(1:length(list_of_dfs), function(x) list_of_dfs[[x]][i]))

  }

  df_cbind               <- do.call( cbind, df_dfs )
  rownames( df_cbind )   <- rownames (length_df)
  colnames( df_cbind )   <- names_list

  df_cbind

}

Running the example

##########################################################
####             Running the example                 #####
##########################################################

rbind_null_df_lists ( list_of_df )
立即提问