parallel processing - R - parallelizing ldply and replicate functions -


i have tried quite time parallelize code, no avail. either errors or nothing works. have ideas?

cal_ops <- function(n, dtm, ratio = 0.1) {   print(n)   selvect <- sample(nrow(dtm), nrow(dtm) * ratio)   holdout <- dtm[selvect,]   training <- dtm[-selvect,]   topmodel <- lda(training, n, control = list(estimate.alpha = false))   return(c(n, perplexity(topmodel, holdout), as.numeric(loglik(topmodel)))) }  require(plyr)  replication <- 1000  sequ <-seq(5,100,5)  perplex <- ldply(sequ, function(x, dtm) {    t(replicate(replication, cal_ops(x, dtm))) } , dtm = dtm_to_use) 

it takes long time run is. thank you, in advance.

i've tried using example parallel version of replicate - but, had many errors: https://stackoverflow.com/a/19281611/8598566

your example not reproducible, e.g. dtm_to_use not defined, making hard other "the-following-should-work" suggestion:

the plyr::ldply(x) function takes argument .parallel = true, process x in chunks distributed whatever number of workers have. uses foreach framework internally parallel processing. can use of "do"-packages. here example using future backends:

library("dofuture") registerdofuture()  ## utilize cores available r session plan(multiprocess)  replication <- 1000 sequ <-seq(from = 5, = 100, = 5) perplex <- plyr::ldply(sequ, function(x) {    t(replicate(replication, c(a = x, b = sqrt(x)))) }, .parallel = true)  str(perplex) 'data.frame':   20000 obs. of  2 variables:  $ a: num  5 5 5 5 5 5 5 5 5 5 ...  $ b: num  2.24 2.24 2.24 2.24 2.24 ... 

since mentioned hpc target: if have ad-hoc cluster without job scheduler can ssh each node, can use:

plan(cluster, workers = c("node1", "node2", "node2", "node3")) 

to run 1 core each on node1 , node3 , 2 cores on node2. if have real job scheduler, sge, can use:

library("future.batchtools") plan(batchtools_sge) 

and each element in sequ processed individual job on queue (which corresponds having infinite number of workers). if want chunk up, can limit number of workers (= jobs), e.g.

plan(batchtools_sge, workers = 200) 

you script identical regardless of backend used.


Comments