2013-03-24

A Merging Test Bench

As requested here's the packed data and a test bench you can test your own merging function ideas and replicate my results (hopefully). If you want the plots you can use the end part of scripts in part1 part2.

The data is a bunch of super secret Eve Online killmails. The first part of the script handles the json to data.frame transformation. After that I introduce the R core merging in a for-loop, data.table variant, reduce function variants, and finally the fast.merging. There's actually alot more data then we are using in the bench. Totalling 166106 killmails, merging only 3000 of those. That is because as noted in previous posts the merge for-looping seems to work in exponential time so going much higher then 3000 lines seems really excessive. Even now it took over an hour to run the whole bench on intel i5 3570k (not overclocked). So if you want faster results choose 1000ish frames and set rbenchmark replications to 1. With low frame count (under 100) the results are fairly similiar.

If you are going to test are the results from different functions the same, note that the order of rows and columns will be different. I'm eagerly waiting for you results and ideas.

#Using 64bit R 2.15.3
#Packages we are going to use
library(rjson)
library(data.table)
library(rbenchmark)
#set your WD where your unpackaged file is, or add path.
#Reading only partial data for this test, you can naturally do as you wish
eve = readLines("kms-2013-01-06-s.json", n=3001)
#json => data.frames (data lines)
eve = matrix(eve, ncol=1)
eve.list = apply(eve, 1, list)
#First and last lines are useless
eve.list = eve.list[-1]
#eve.list = eve.list[-length(eve.list)] #didn't read the last line in this case
#Handling json...
eve.list = lapply(eve.list, function(x){
x = unlist(x)
return(fromJSON(x))
})
#Handling data.frame
eve.data.frame = function(eve.list){
framo = unlist(eve.list)
dups = unique(names(framo)[duplicated(names(framo))])
if(length(dups) > 0){
for(i in seq(dups)){
bol = names(framo) %in% dups[i]
framo[unique(dups[i])] = paste(framo[bol], collapse= ",")
}
framo = framo[!duplicated(names(framo))]
}
return(as.matrix(t(framo)))
}
#This takes awhile if you read the whole data
eve.frames = lapply(eve.list, eve.data.frame)
#For example
eve.frames[[1]]
eve.frames[[100]]
#The amount of variables in each data line
N.variables = sapply(eve.frames, ncol)
table(N.variables)
#Converting data.frames to data.tables
eve.data.table = lapply(eve.frames, as.data.table)
#Is this correct?
#The functions used
core.forLoop <- function(data.list) {
complete = data.list[[1]]
for(i in 2:length(data.list)){
part = data.list[[i]]
complete = merge(complete, part,
all=TRUE,
sort=FALSE)
}
return(complete)
}
data.table.forLoop <- function(data.list){
complete = data.list[[1]]
for(i in 2:length(data.list)){
part = data.list[[i]]
complete = merge(complete, part,
by=intersect( names(complete), names(part) ),
all=TRUE,
sort=FALSE)
}
return(complete)
}
core.Reduce.merge = function(data.list){
Reduce(function(a, b) merge(a, b, all=TRUE, sort=FALSE), data.list)
}
data.table.Reduce.merge = function(data.list){
Reduce(function(a, b) merge(a, b, by=intersect(names(a), names(b)), all=TRUE, sort=FALSE), data.list)
}
fast.merging = function(data.list, nparts){
if(!is.list(data.list)) stop("data.list isn't a list")
while(length(data.list) != 1){ #Loop until everything is merged
if(length(data.list) > nparts){
starts = seq(1, length(data.list), nparts)
ends = seq(nparts, length(data.list), nparts) #starts and ends are of equal size if length(data.list) divides nparts.
if(length(ends) < length(starts)) ends = c(ends, length(data.list)) #making sure things are even
sections = matrix(c(starts, ends), ncol=2, byrow=FALSE)
sections = apply(sections, 1, list)
}else{
sections = list(c(1, length(data.list)))
}
#We have the standard way inside lapply
data.list = lapply(sections, function(x, data.list){
if(is.list(x)) x = x[[1]]
#the standard way starts ->
part = data.list[[x[1]]]
for(i in x[1]:x[2]){
part = merge(part, data.list[[i]], all=TRUE, sort=FALSE)
}
#<- standard way ends
return(part)
}, data.list = data.list)
}
return(data.list[[1]]) #returning the merged data frame
}
#This will take awhile, go to lunch or something like that.
benchmark(core.forLoop(eve.frames),
data.table.forLoop(eve.data.table),
core.Reduce.merge(eve.frames),
data.table.Reduce.merge(eve.data.table),
fast.merging(eve.frames, 10),
replications=5)
# test replications elapsed relative user.self sys.self user.child sys.child
#1 core.forLoop(eve.frames) 5 1398.29 4.932 1393.98 0.06 NA NA
#3 core.Reduce.merge(eve.frames) 5 1405.77 4.958 1401.02 0.31 NA NA
#2 data.table.forLoop(eve.data.table) 5 797.76 2.814 794.61 0.09 NA NA
#4 data.table.Reduce.merge(eve.data.table) 5 801.93 2.829 799.91 0.16 NA NA
#5 fast.merging(eve.frames, 10) 5 283.51 1.000 282.95 0.00 NA NA
#If you are checking are going to check are the results the same, remember to sort your rows and columns
#basically data[order(data[, 1]), order(colnames(data))] for all data.

4 comments:

  1. Anonymous25/3/13 18:58

    Merge seems like the wrong tool to use in this instance and thus why this operation takes so long. All you are doing is an rbind with NAs filled in form missing columns. There are likely faster methods out there but I don't think merge is the right tool in this instance.

    Try the following making sure eve.frames are actually data.frames:

    library(plyr)

    eve.frames2 <- lapply(eve.frames, as.data.frame,stringsAsFactors=FALSE)

    TMP2 <- do.call(rbind.fill, eve.frames2)

    ReplyDelete
  2. Yes, your plyr was faster. I'm just wondering will it translate to a parallel solution ? Because I think I can just parallel FM my way out. :)

    Also I'm wondering if you have to use merge( , all=TRUE) what would be the general good method. Because it's one of the most often repeated lines I need at work. Even if it's not actually needed, I want to be sure. Of course this example is kind of lame because you can just easily plyr your way out of it.

    After a quick test (not parallel):

    at 10 000 frames.
    Elapsed (sec)
    plyr 124.03
    fm 177.84
    fm2 10.94 (my newest solution)

    My newest solution is basically: sort all by columns => rbind unique frames => merge. I'll spam the code if it actually works. Currently it does it pretty fast, I'm just not 100% sure does it do it correctly.

    ReplyDelete
  3. Anonymous25/3/13 21:41

    First refine your algorithm and then think about going parallel. Should a problem of this size take minutes or even 10s of seconds for completion? Likely not. This matrix version avoids data.frames and will get your there faster than your previous versions. Tweak it to get the output how you want it (and I'm sure it can be improved). Likely another part of your code is now the bottleneck.

    As an aside I think you are abusing both the base merge and data.table::merge and you should be careful to not jump to the wrong conclusions about speed.

    After you have created eve.frames run this code.

    # GET THE UNIQUE NAMES IN ALL THE FILES
    nm <- unique(unlist(sapply(eve.frames , colnames)))

    # PREALLOCATE
    eve.matrix <- matrix(NA_character_, nrow=length(eve), ncol=length(nm))
    colnames(eve.matrix) <- nm

    system.time({
    # LOOP THROUGH EACH ENTRY AND FILL MATRIX
    for(i in 1:length(eve.frames)){
    eveTmp <- eve.frames[[i]]
    ind <- nm[(nm %in% colnames(eveTmp))]
    eve.matrix[i, ind] <- eveTmp[,ind]
    }
    })

    Jonathan

    ReplyDelete
    Replies
    1. Hi Jonathan,

      True, that's basically the same as plyr packages rbind.fill. But much faster. It's interesting that the order of the columns doesn't matter, and the fact that it's faster.

      Already done the parallel solution for the old one so doesn't really matter.

      This is hopefully just some preliminary testing. I'm going to fragment the data so that rbind solutions will not work. So then we can truly measure the merging speed and we can check is the result correct.

      And I'm assuming that the speed here will correlate to the speed later on when we do this on the fragmented data. If my assumption is wrong, what does that tell us about the current merging functions?

      - Xachriel

      Delete