R: Avoiding loops when using multiple conditions that are dependent on each other -
not sure title clear like. below code written out using loops:
for(i in 1:length(varianttable[,1])){ #n stores counts of numbers of population totals of populations contain variant in question n = 0 #nf.pop stores frequencies nf.eas = 0 nf.amr = 0 nf.eur = 0 nf.sas = 0 if(varianttable[i,]$eas_maf > 0){ nf.eas = eascount * varianttable[i,]$eas_maf n = n + eascount } if(varianttable[i,]$amr_maf > 0){ nf.amr = amrcount * varianttable[i,]$amr_maf n = n + amrcount } if(varianttable[i,]$eur_maf > 0){ nf.eur = eurcount * varianttable[i,]$eur_maf n = n + eurcount } if(varianttable[i,]$sas_maf > 0){ nf.sas = sascount * varianttable[i,]$sas_maf n = n + sascount } varianttable[i,]$nonafr_n <- n varianttable[i,]$nonafr_weighted <- (nf.eas + nf.amr + nf.eur + nf.sas)/n }
as can see varianttable[i,]$nonafr_weighted
calculated based on conditions across multiple columns (eas_maf, amr_maf, afr_maf, eur_maf, sas_maf).
i aware loops not quickest way in r, particularly considering fact dataset consists of 900000 rows.
i have started working ifelse , apply methods unsure how use them in situation such this. did try create function takes in 1 row , calculates values row, , using apply method hasn't worked not sure input should be.
any advice on how best proceed problem such this?
edit: here dput of data:
> dput(head(varianttable)) structure(list(chrom = c("1", "1", "1", "1", "1", "1"), pos = c(69224l, 69428l, 69486l, 69487l, 69496l, 69521l), id = c("rs568964432", "rs140739101", "rs548369610", "rs568226429", "rs150690004", "rs553724620" ), ref = c("a", "t", "c", "g", "g", "t"), alt = c("t", "g", "t", "a", "a", "a"), af = c(0.000399361, 0.0189696, 0.000199681, 0.000399361, 0.000998403, 0.000399361), ac = c(2l, 95l, 1l, 2l, 5l, 2l), = c(5008l, 5008l, 5008l, 5008l, 5008l, 5008l), eas_af = c(0, 0.003, 0.001, 0, 0, 0), amr_af = c(0.0029, 0.036, 0, 0, 0.0014, 0.0029), afr_af = c(0, 0.0015, 0, 0.0015, 0.003, 0), eur_af = c(0, 0.0497, 0, 0, 0, 0), sas_af = c(0, 0.0153, 0, 0, 0, 0), consequence = c("nonsynonymous snv", "nonsynonymous snv", "synonymous snv", "nonsynonymous snv", "nonsynonymous snv", "nonsynonymous snv"), gene = c("or4f5", "or4f5", "or4f5", "or4f5", "or4f5", "or4f5"), refgene_id = c("nm_001005484", "nm_001005484", "nm_001005484", "nm_001005484", "nm_001005484", "nm_001005484" ), aa_change = c("('d', 'v')", "('f', 'c')", "('n', 'n')", "('a', 't')", "('g', 's')", "('i', 'n')"), x0.fold_count = c(572l, 572l, 572l, 572l, 572l, 572l), x4.fold_count = c(141l, 141l, 141l, 141l, 141l, 141l), eas_maf = c(0, 0.003, 0.001, 0, 0, 0), amr_maf = c(0.0029, 0.036, 0, 0, 0.0014, 0.0029), afr_maf = c(0, 0.0015, 0, 0.0015, 0.003, 0), eur_maf = c(0, 0.0497, 0, 0, 0, 0), sas_maf = c(0, 0.0153, 0, 0, 0, 0), nonafr_af = c(0.0029, 0.104, 0.001, 0, 0.0014, 0.0029), nonafr_n = c(309227, 1128036, 262551, 0, 309227, 309227 ), nonafr_weighted = c(0.0029, 0.0261704282487438, 0.001, nan, 0.0014, 0.0029)), .names = c("chrom", "pos", "id", "ref", "alt", "af", "ac", "an", "eas_af", "amr_af", "afr_af", "eur_af", "sas_af", "consequence", "gene", "refgene_id", "aa_change", "x0.fold_count", "x4.fold_count", "eas_maf", "amr_maf", "afr_maf", "eur_maf", "sas_maf", "nonafr_af", "nonafr_n", "nonafr_weighted"), row.names = c(na, 6l), class = "data.frame")
the populations counts (eascount, amrcount etc) have been defined such:
eascount <- length(varianttable$eas_maf[varianttable$eas_maf>0]) amrcount <- length(varianttable$eas_maf[varianttable$amr_maf>0]) afrcount <- length(varianttable$eas_maf[varianttable$afr_maf>0]) eurcount <- length(varianttable$eas_maf[varianttable$eur_maf>0]) sascount <- length(varianttable$eas_maf[varianttable$sas_maf>0])
the output looking calculation varianttable$nonafr_n , varianttable$nonafr_weighted. example below correct calculation:
> varianttable[2,] chrom pos id ref alt af ac eas_af amr_af afr_af eur_af sas_af consequence 2 1 69428 rs140739101 t g 0.0189696 95 5008 0.003 0.036 0.0015 0.0497 0.0153 nonsynonymous snv gene refgene_id aa_change x0.fold_count x4.fold_count eas_maf amr_maf afr_maf eur_maf sas_maf 2 or4f5 nm_001005484 ('f', 'c') 572 141 0.003 0.036 0.0015 0.0497 0.0153 nonafr_af nonafr_n nonafr_weighted 2 0.104 1128036 0.02617043
would work ?
library(dplyr) varianttable %>% mutate( nf.eas = eascount * eas_maf, nf.amr = amrcount * amr_maf, nf.eur = eurcount * eur_maf, nf.sas = sascount * sas_maf, nonafr_n = eascount * (eas_maf>0) + amrcount * (amr_maf>0) + eurcount * (eur_maf>0) + sascount * (sas_maf>0), nonafr_weighted = (nf.eas + nf.amr + nf.eur + nf.sas)/nonafr_n) %>% select(-c(nf.eas,nf.amr,nf.eur,nf.sas))
mutate adds or modify columns table, allows use of column names without $
notation. if
structure wasn't necessary because multiplying 0 default 0 value anyway, first part of script simplified.
booleans coerced integers 0
, 1
when used arithmetic operators, didn't need if
structure compute n
either.
and last column quite straightforward.
all these operations vectorized, meaning added, subtracted, multiplied, divided columns directly rather discrete values, it's faster machine , easier on eyes.
also, more efficient , easier read:
eascount <- sum(varianttable$eas_maf>0) amrcount <- sum(varianttable$amr_maf>0) afrcount <- sum(varianttable$afr_maf>0) eurcount <- sum(varianttable$eur_maf>0) sascount <- sum(varianttable$sas_maf>0)
Comments
Post a Comment