R: Avoiding loops when using multiple conditions that are dependent on each other -

not sure title clear like. below code written out using loops:

for(i in 1:length(varianttable[,1])){   #n stores counts of numbers of population totals of populations contain variant in question   n = 0   #nf.pop stores frequencies   nf.eas = 0   nf.amr = 0   nf.eur = 0   nf.sas = 0   if(varianttable[i,]$eas_maf > 0){     nf.eas = eascount * varianttable[i,]$eas_maf     n = n + eascount   }   if(varianttable[i,]$amr_maf > 0){     nf.amr = amrcount * varianttable[i,]$amr_maf     n = n + amrcount   }   if(varianttable[i,]$eur_maf > 0){     nf.eur = eurcount * varianttable[i,]$eur_maf     n = n + eurcount   }   if(varianttable[i,]$sas_maf > 0){     nf.sas = sascount * varianttable[i,]$sas_maf     n = n + sascount   }   varianttable[i,]$nonafr_n <- n    varianttable[i,]$nonafr_weighted <- (nf.eas + nf.amr + nf.eur + nf.sas)/n }

as can see varianttable[i,]$nonafr_weighted calculated based on conditions across multiple columns (eas_maf, amr_maf, afr_maf, eur_maf, sas_maf).

i aware loops not quickest way in r, particularly considering fact dataset consists of 900000 rows.

i have started working ifelse , apply methods unsure how use them in situation such this. did try create function takes in 1 row , calculates values row, , using apply method hasn't worked not sure input should be.

any advice on how best proceed problem such this?

edit: here dput of data:

> dput(head(varianttable)) structure(list(chrom = c("1", "1", "1", "1", "1", "1"), pos = c(69224l,  69428l, 69486l, 69487l, 69496l, 69521l), id = c("rs568964432",  "rs140739101", "rs548369610", "rs568226429", "rs150690004", "rs553724620" ), ref = c("a", "t", "c", "g", "g", "t"), alt = c("t", "g", "t",  "a", "a", "a"), af = c(0.000399361, 0.0189696, 0.000199681, 0.000399361,  0.000998403, 0.000399361), ac = c(2l, 95l, 1l, 2l, 5l, 2l), = c(5008l,  5008l, 5008l, 5008l, 5008l, 5008l), eas_af = c(0, 0.003, 0.001,  0, 0, 0), amr_af = c(0.0029, 0.036, 0, 0, 0.0014, 0.0029), afr_af = c(0,  0.0015, 0, 0.0015, 0.003, 0), eur_af = c(0, 0.0497, 0, 0, 0,  0), sas_af = c(0, 0.0153, 0, 0, 0, 0), consequence = c("nonsynonymous snv",  "nonsynonymous snv", "synonymous snv", "nonsynonymous snv", "nonsynonymous snv",  "nonsynonymous snv"), gene = c("or4f5", "or4f5", "or4f5", "or4f5",  "or4f5", "or4f5"), refgene_id = c("nm_001005484", "nm_001005484",  "nm_001005484", "nm_001005484", "nm_001005484", "nm_001005484" ), aa_change = c("('d', 'v')", "('f', 'c')", "('n', 'n')", "('a', 't')",  "('g', 's')", "('i', 'n')"), x0.fold_count = c(572l, 572l, 572l,  572l, 572l, 572l), x4.fold_count = c(141l, 141l, 141l, 141l,  141l, 141l), eas_maf = c(0, 0.003, 0.001, 0, 0, 0), amr_maf = c(0.0029,  0.036, 0, 0, 0.0014, 0.0029), afr_maf = c(0, 0.0015, 0, 0.0015,  0.003, 0), eur_maf = c(0, 0.0497, 0, 0, 0, 0), sas_maf = c(0,  0.0153, 0, 0, 0, 0), nonafr_af = c(0.0029, 0.104, 0.001, 0, 0.0014,  0.0029), nonafr_n = c(309227, 1128036, 262551, 0, 309227, 309227 ), nonafr_weighted = c(0.0029, 0.0261704282487438, 0.001, nan,  0.0014, 0.0029)), .names = c("chrom", "pos", "id", "ref", "alt",  "af", "ac", "an", "eas_af", "amr_af", "afr_af", "eur_af", "sas_af",  "consequence", "gene", "refgene_id", "aa_change", "x0.fold_count",  "x4.fold_count", "eas_maf", "amr_maf", "afr_maf", "eur_maf",  "sas_maf", "nonafr_af", "nonafr_n", "nonafr_weighted"), row.names = c(na,  6l), class = "data.frame")

the populations counts (eascount, amrcount etc) have been defined such:

eascount <- length(varianttable$eas_maf[varianttable$eas_maf>0]) amrcount <- length(varianttable$eas_maf[varianttable$amr_maf>0]) afrcount <- length(varianttable$eas_maf[varianttable$afr_maf>0]) eurcount <- length(varianttable$eas_maf[varianttable$eur_maf>0]) sascount <- length(varianttable$eas_maf[varianttable$sas_maf>0])

the output looking calculation varianttable$nonafr_n , varianttable$nonafr_weighted. example below correct calculation:

> varianttable[2,]   chrom   pos          id ref alt        af ac   eas_af amr_af afr_af eur_af sas_af       consequence 2     1 69428 rs140739101   t   g 0.0189696 95 5008  0.003  0.036 0.0015 0.0497 0.0153 nonsynonymous snv    gene   refgene_id  aa_change x0.fold_count x4.fold_count eas_maf amr_maf afr_maf eur_maf sas_maf 2 or4f5 nm_001005484 ('f', 'c')           572           141   0.003   0.036  0.0015  0.0497  0.0153   nonafr_af nonafr_n nonafr_weighted 2     0.104  1128036      0.02617043

would work ?

library(dplyr) varianttable %>% mutate(   nf.eas = eascount * eas_maf,   nf.amr = amrcount * amr_maf,   nf.eur = eurcount * eur_maf,   nf.sas = sascount * sas_maf,   nonafr_n = eascount * (eas_maf>0) + amrcount * (amr_maf>0) + eurcount * (eur_maf>0) + sascount * (sas_maf>0),   nonafr_weighted = (nf.eas + nf.amr + nf.eur + nf.sas)/nonafr_n) %>%   select(-c(nf.eas,nf.amr,nf.eur,nf.sas))

mutate adds or modify columns table, allows use of column names without $ notation. if structure wasn't necessary because multiplying 0 default 0 value anyway, first part of script simplified.

booleans coerced integers 0 , 1 when used arithmetic operators, didn't need if structure compute n either.

and last column quite straightforward.

all these operations vectorized, meaning added, subtracted, multiplied, divided columns directly rather discrete values, it's faster machine , easier on eyes.

also, more efficient , easier read:

eascount <- sum(varianttable$eas_maf>0) amrcount <- sum(varianttable$amr_maf>0) afrcount <- sum(varianttable$afr_maf>0) eurcount <- sum(varianttable$eur_maf>0) sascount <- sum(varianttable$sas_maf>0)

Search This Blog

Single

R: Avoiding loops when using multiple conditions that are dependent on each other -

Comments

Post a Comment

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -