r - Time difference between different subsetting methods for data.frame and matrix objects -


consider following benchmark (r 3.4.1 on windows machine):

library(rbenchmark)  mtx <- matrix(runif(1e8), ncol = 100) df <- as.data.frame(mtx)  colnames(mtx) <- colnames(df) <- paste0("v", 1:100)  benchmark(   mtx[5000:7000, 80],   mtx[5000:7000, "v80"],   mtx[, "v80"][5000:7000],   mtx[, "v80", drop = false][5000:7000, ],   mtx[5000:7000, , drop = false][, "v80"],   #mtx$v80[5000:7000], # not apply   replications = 5000 )  ##                                      test replications elapsed relative user.self sys.self user.child sys.child ## 4 mtx[, "v80", drop = false][5000:7000, ]         5000   64.71  588.273     47.44    16.61         na        na ## 3                 mtx[, "v80"][5000:7000]         5000   72.15  655.909     52.90    18.18         na        na ## 2                   mtx[5000:7000, "v80"]         5000    0.11    1.000      0.11     0.00         na        na ## 5 mtx[5000:7000, , drop = false][, "v80"]         5000    7.47   67.909      5.89     1.47         na        na ## 1                      mtx[5000:7000, 80]         5000    0.13    1.182      0.12     0.00         na        na  benchmark(   df[5000:7000, 80],   df[5000:7000, "v80"],   df[, "v80"][5000:7000],   df[, "v80", drop = false][5000:7000, ],   df[5000:7000, , drop = false][, "v80"],   df$v80[5000:7000],   replications = 5000 )  ##                                     test replications elapsed relative user.self sys.self user.child sys.child ## 6                      df$v80[5000:7000]         5000    0.13    1.000      0.12     0.00         na        na ## 4 df[, "v80", drop = false][5000:7000, ]         5000    0.33    2.538      0.33     0.00         na        na ## 3                 df[, "v80"][5000:7000]         5000    0.17    1.308      0.17     0.00         na        na ## 2                   df[5000:7000, "v80"]         5000    0.15    1.154      0.16     0.00         na        na ## 5 df[5000:7000, , drop = false][, "v80"]         5000   13.63  104.846     12.91     0.39         na        na ## 1                      df[5000:7000, 80]         5000    0.19    1.462      0.17     0.00         na        na 

the time difference pretty dramatic. why that? the recommended way of subsetting , why? given benchmarks, mtx[i, colname] way matrix , df$colname[i] (but doesn't seem make difference) data.frame seem time-efficient, there general reasons why should prefer of approaches?

the main reason lies in r data structures behind matrices , data.frames. matrix object rownumber x columnnumber (mainly numeric) entries (by r's default matrix not sparse) , dimension property. reason, first 2 commands

mtx[5000:7000, 80], mtx[5000:7000, "v80"] 

extract again matrices r not assign values dimension creating new matrix objects instead of simple vectors r's default objects.

on other hand, data.frame in r definition special type of list object length of each column object has identical, whereas columns may contain different types of variables (numerical, string etc.). matrices can contain single types of variable general 1 default. thus,

df[5000:7000, 80] 

extracts vector of 80th column , values on position 5000-7000 out of one. vector far more simple handle r matrix object , therefore, far quicker.

if choose drop=false, however, force r not work simple vector object when selecting 80th column, treat a data.frame/list object instead. lists general , flexible type of r objects, there no restraints regarding size , entries, comes @ price difficult , time consuming handle, can observe when comparing

mtx[5000:7000, , drop = false][, "v80"] df[5000:7000, , drop = false][, "v80"] 

from data frame obtain data.frame/list, whereas matrix still returns matrix still faster handle list.


Comments

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

minify - Minimizing css files -