How to improve speed on reading binary file in R -
i have read in quite big binary files r (to process, transform, convert other formats). approach, in general, works okay but, unfortunately, script runs forever (>12h first part of data less 2% of data).
i assume problem not genuinely due size of data (at least not explanation) due inefficient code. looking way speed runtime , grateful help!
my approach based on tutorial: https://stats.idre.ucla.edu/r/faq/how-can-i-read-binary-data-into-r/
in code below include 2 variables, instead of thousands. in total, data ~100gb, said above processing first part (<2%) takes >12h.
the data separated smaller files process separately (for every part 1 script , 1 dataset).
my code:
newdata = file(paste0(getwd(), "/file.dat"), "rb") # here, first 2 variables dataset <- data.table(id = integer(), v1 = integer()) # 327639 number of cases (data on people) for(i in 1:327639) { bla <- readbin(con = newdata, integer(), size = 2, n=2000, endian = "big") id <- v1 = bla[1] dataset <- rbind(dataset, list(id, v1)) } save(dataset, file = paste0(getwd(), "/output/", "part_a.rdata")) close(newdata)
thanks help!
maybe wrong, if please excuse noise.
idea in comment implemented follows. (untested, no data file.)
#numbytes <- file.size(newdata) numbytes <- 327639l dataset <- data.table(id = integer(numbytes), v1 = integer(numbytes)) chunk <- 2^15 passes <- numbytes %/% chunk remainder <- numbytes %% chunk <- 1l for(j in seq_len(passes)){ bla <- readbin(con = newdata, integer(), n = chunk, size = 2, endian = "big") dataset$id[i:(i + chunk - 1l)] <- i:(i + chunk - 1l) dataset$vl[i:(i + chunk - 1l)] <- bla <- + chunk } bla <- readbin(con = newdata, integer(), n = remainder, size = 2, endian = "big") dataset$id[i:numbytes] <- i:numbytes dataset$vl[i:numbytes] <- bla
Comments
Post a Comment