r - Time difference between different subsetting methods for data.frame and matrix objects -
consider following benchmark (r 3.4.1 on windows machine):
library(rbenchmark) mtx <- matrix(runif(1e8), ncol = 100) df <- as.data.frame(mtx) colnames(mtx) <- colnames(df) <- paste0("v", 1:100) benchmark( mtx[5000:7000, 80], mtx[5000:7000, "v80"], mtx[, "v80"][5000:7000], mtx[, "v80", drop = false][5000:7000, ], mtx[5000:7000, , drop = false][, "v80"], #mtx$v80[5000:7000], # not apply replications = 5000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 4 mtx[, "v80", drop = false][5000:7000, ] 5000 64.71 588.273 47.44 16.61 na na ## 3 mtx[, "v80"][5000:7000] 5000 72.15 655.909 52.90 18.18 na na ## 2 mtx[5000:7000, "v80"] 5000 0.11 1.000 0.11 0.00 na na ## 5 mtx[5000:7000, , drop = false][, "v80"] 5000 7.47 67.909 5.89 1.47 na na ## 1 mtx[5000:7000, 80] 5000 0.13 1.182 0.12 0.00 na na benchmark( df[5000:7000, 80], df[5000:7000, "v80"], df[, "v80"][5000:7000], df[, "v80", drop = false][5000:7000, ], df[5000:7000, , drop = false][, "v80"], df$v80[5000:7000], replications = 5000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 6 df$v80[5000:7000] 5000 0.13 1.000 0.12 0.00 na na ## 4 df[, "v80", drop = false][5000:7000, ] 5000 0.33 2.538 0.33 0.00 na na ## 3 df[, "v80"][5000:7000] 5000 0.17 1.308 0.17 0.00 na na ## 2 df[5000:7000, "v80"] 5000 0.15 1.154 0.16 0.00 na na ## 5 df[5000:7000, , drop = false][, "v80"] 5000 13.63 104.846 12.91 0.39 na na ## 1 df[5000:7000, 80] 5000 0.19 1.462 0.17 0.00 na na the time difference pretty dramatic. why that? the recommended way of subsetting , why? given benchmarks, mtx[i, colname] way matrix , df$colname[i] (but doesn't seem make difference) data.frame seem time-efficient, there general reasons why should prefer of approaches?
the main reason lies in r data structures behind matrices , data.frames. matrix object rownumber x columnnumber (mainly numeric) entries (by r's default matrix not sparse) , dimension property. reason, first 2 commands
mtx[5000:7000, 80], mtx[5000:7000, "v80"] extract again matrices r not assign values dimension creating new matrix objects instead of simple vectors r's default objects.
on other hand, data.frame in r definition special type of list object length of each column object has identical, whereas columns may contain different types of variables (numerical, string etc.). matrices can contain single types of variable general 1 default. thus,
df[5000:7000, 80] extracts vector of 80th column , values on position 5000-7000 out of one. vector far more simple handle r matrix object , therefore, far quicker.
if choose drop=false, however, force r not work simple vector object when selecting 80th column, treat a data.frame/list object instead. lists general , flexible type of r objects, there no restraints regarding size , entries, comes @ price difficult , time consuming handle, can observe when comparing
mtx[5000:7000, , drop = false][, "v80"] df[5000:7000, , drop = false][, "v80"] from data frame obtain data.frame/list, whereas matrix still returns matrix still faster handle list.
Comments
Post a Comment