r - How to address the value in the next row -


i have data set of equity returns thousand companies past 30 years. of these companies "dead" (usually delisted or bankrupt), , therefore have returns = 0. want assign nas returns of these companies, after time "died". in order so, have tried use following code:

if(returns$r == 0 & stri_detect_fixed(returns$company, "dead"), na.rm = true){   returns$r[returns$r == 0 & stri_detect_fixed(returns$company, "dead")] <- na } 

this works pretty well, unfortunately there on occasion return values equal 0 dead/delisted companies before "died", , these values want remain @ 0.

therefore, need command/if condition telling r want return nas if return in next row equal 0 well. guys have suggestions? hope made problem clear, though know explanation may bit confusing.

reproducible example

returns <- structure(list(date = c("04.09.17", "05.09.17", "06.09.17", "01.09.17",  "02.09.17", "03.09.17", "04.09.17", "05.09.17", "06.09.17", "04.09.17",  "05.09.17", "06.09.17"), company = c("orkla", "orkla", "orkla",  "visma dead 04.09.17", "visma dead 04.09.17", "visma dead 04.09.17",  "visma dead 04.09.17", "visma dead 04.09.17", "visma dead 04.09.17",  "xnewco", "xnewco", "xnewco"), r = c(0.04, 0, -0.02, 0.01, 0, -0.03,  0, 0, 0, 0.01, 0, 0)), .names = c("date", "company", "r"), row.names = c(na,  -12l), class = "data.frame") 

(edited cover case "live" company has 0 returns @ end of time series)

my data frame returns looks this:

date       company                r 04.09.17   orkla                  0.04 05.09.17   orkla                  0.00 06.09.17   orkla                  -0.02 01.09.17   visma dead 04.09.17    0.01 02.09.17   visma dead 04.09.17    0.00 03.09.17   visma dead 04.09.17    -0.03 04.09.17   visma dead 04.09.17    0.00 05.09.17   visma dead 04.09.17    0.00 06.09.17   visma dead 04.09.17    0.00  04.09.17   xnewco                 0.01 05.09.17   xnewco                 0.00 06.09.17   xnewco                 0.00 

i want this:

date       company                r 04.09.17   orkla                  0.04 05.09.17   orkla                  0.00 06.09.17   orkla                  -0.02 01.09.17   visma dead 04.09.17    0.01 02.09.17   visma dead 04.09.17    0.00 03.09.17   visma dead 04.09.17    -0.03 04.09.17   visma dead 04.09.17    na 05.09.17   visma dead 04.09.17    na 06.09.17   visma dead 04.09.17    na 04.09.17   xnewco                 0.01 05.09.17   xnewco                 0.00 06.09.17   xnewco                 0.00 

my current code (as you'll see above) wouldn't work, replace return of 0.00 visma 02.09.17 na. need remain 0.00, before visma "died"

edit: while preparing benchmark, noticed condition missing prevent replacement of 0 values @ end of time series companies alive. unfortunately, case not covered original sample data provided op (before editing) went undiscovered. have amended below solutions accordingly.


according op's words i have data set of equity returns thousand companies past 30 years, data set may contain several millions rows (a conservative estimate: 250 business days per year * 2000 companies * 5 years of average existence = 2.5 m rows))

thus, need replace few values without copying whole data set. data.table allows update data in place.

the op has requested find consecutive sequences of zeros @ end of time series of each company , replace these zeros na.

with data.table, there 2 options here:

using rleid() function

library(data.table) # coerce data.table setdt(returns) # convert character dates returns[, date := as.idate(date, "%d.%m.%y")][] # make sure data ordered setorder(returns, company, date)[]  returns[, company %like% "dead" & r == 0 & rleid(r == 0) == max(rleid(r == 0)),          = company] 

for each dead company, last sequence of 0 values picked:

                company    v1  1:               orkla false  2:               orkla false  3:               orkla false  4: visma dead 04.09.17 false  5: visma dead 04.09.17 false  6: visma dead 04.09.17 false  7: visma dead 04.09.17  true  8: visma dead 04.09.17  true  9: visma dead 04.09.17  true 10:              xnewco false 11:              xnewco false 12:              xnewco false 

the v1 column used subset , update dt in place:

returns[returns[,  company %like% "dead" & r == 0 & rleid(r == 0) == max(rleid(r == 0)),                  = company]$v1, r := na_real_][] 
          date             company     r  1: 2017-09-04               orkla  0.04  2: 2017-09-05               orkla  0.00  3: 2017-09-06               orkla -0.02  4: 2017-09-01 visma dead 04.09.17  0.01  5: 2017-09-02 visma dead 04.09.17  0.00  6: 2017-09-03 visma dead 04.09.17 -0.03  7: 2017-09-04 visma dead 04.09.17    na  8: 2017-09-05 visma dead 04.09.17    na  9: 2017-09-06 visma dead 04.09.17    na 10: 2017-09-04              xnewco  0.01 11: 2017-09-05              xnewco  0.00 12: 2017-09-06              xnewco  0.00 

finding index of last non-zero value

returns[, {tmp <- last(which(r != 0))             if (company %like% "dead" & tmp < .n) .i[seq.int(tmp + 1l, .n)]}, = company] 

here, position of last non-zero value of each time series picked used construct indices remaining 0 values dead companies. .i , .n special symbols in data.table syntax. check if (company %like% "dead" & tmp < .n) required in case there no 0 values @ end of time series of dead company.

               company v1 1: visma dead 04.09.17  7 2: visma dead 04.09.17  8 3: visma dead 04.09.17  9 

as above, v1 used subset , update returns in place:

returns[returns[, {tmp <- last(which(r != 0))                    if (company %like% "dead" & tmp < .n) .i[seq.int(tmp + 1l, .n)]},                  = company]$v1, r := na_real_][] 
          date             company     r  1: 2017-09-04               orkla  0.04  2: 2017-09-05               orkla  0.00  3: 2017-09-06               orkla -0.02  4: 2017-09-01 visma dead 04.09.17  0.01  5: 2017-09-02 visma dead 04.09.17  0.00  6: 2017-09-03 visma dead 04.09.17 -0.03  7: 2017-09-04 visma dead 04.09.17    na  8: 2017-09-05 visma dead 04.09.17    na  9: 2017-09-06 visma dead 04.09.17    na 10: 2017-09-04              xnewco  0.01 11: 2017-09-05              xnewco  0.00 12: 2017-09-06              xnewco  0.00 

benchmark

hack-r claimed his solutions should perform perhaps 1m rows. so, wanted verify claim benchmark.

creating benchmark data

library(data.table)  # create benchmark data n_days <- 100l n_comp <- 100l n_dead <- round(0.1 * n_comp) # 10 per cent of companies dead date <- seq(from = as.idate("2015-01-01"), length.out = n_days, = "1 day") # company "names" consist of 4 digits @ least company <- sprintf("%04i", seq_len(n_comp))   # cross join create combinations returns <- cj(date = date, company = company)  set.seed(1l) # reuired reproducible result returns[, r := round(rnorm(.n)/10.0, 2l)][]  # dead companies dead <- data.table(company = sample(company, n_dead),                    dead.date = sample(date, n_dead)) # modify returns returns[dead, on = .(company, date >= dead.date), r := 0] # modify compay names returns[dead, on = "company", company := paste(company, "dead", dead.date)]  # important: set order setorder(returns, company, date) # keep original version r0 <- copy(returns) 

benchmark code

microbenchmark::microbenchmark(   copy = returns <- copy(r0),   hackr1 = {     mydat <- setdf(copy(r0))     for(i in 1:nrow(mydat)){       if(i==nrow(mydat) & mydat$r[i]==0) {mydat$r[i] <- na       } else if(!is.na(mydat$r[i]) & mydat$r[i]==0 & mydat$r[i+1]==0) mydat$r[i] <- na     }     res_hackr1 <- mydat   },   hackr2 = {     mydat <- copy(r0)     tmp0 <- mydat[0,]     for(c in unique(mydat$company)){       tmp <- mydat[mydat$company==c,]       for(i in 1:nrow(tmp)){         if(!is.na(tmp$r[i]) & tmp$r[i]==0 & tmp$r[i+1]==0){           tmp$r[i:nrow(tmp)] <- na         }       }       tmp0 <- rbind(tmp0, tmp)     }     res_hackr2 <- tmp0   },   dt_rleid1 = {     returns <- copy(r0)     returns[returns[,  company %like% "dead" & r == 0 & rleid(r == 0) == max(rleid(r == 0)),                      = company]$v1, r := na_real_]     res_dt_rleid1 <- copy(returns)     },   dt_rleid2 = {     returns <- copy(r0)     returns[company %like% "dead" & returns[,  r == 0 & rleid(r == 0) == max(rleid(r == 0)),                      = company]$v1, r := na_real_]     res_dt_rleid2 <- copy(returns)   },   dt_last = {     returns <- copy(r0)     returns[returns[, {       tmp <- last(which(r != 0))       if (company %like% "dead") .i[tmp + seq_len(.n - tmp)]     },      = company]$v1, r := na_real_]     res_dt_last <- copy(returns)   },   dt_last2 = {     returns <- copy(r0)     returns[returns[, {       tmp <- last(which(r != 0))       if (company %like% "dead" & tmp < .n) .i[seq.int(tmp + 1l, .n)]     },      = company]$v1, r := na_real_]     res_dt_last2 <- copy(returns)   },   times = 11l ) 

as codes modify data set in place, copy() used create "fresh" unmodified data set before each run , store result later comparison. therefore, copy() timed well.

dt_rleid1 , dt_rleid2 dt_last , dt_last2 code variations of respective solutions.

benchmark results

unfortunately, hackr2 stopped execution error message:

error in if (!is.na(tmp$r[i]) & tmp$r[i] == 0 & tmp$r[i + 1] == 0) { :
missing value true/false needed

the timings remaining solutions are:

unit: microseconds       expr        min         lq         mean     median          uq        max neval cld       copy     46.065     48.331     53.75427     52.485     58.1475     66.077    11       hackr1 267515.143 269559.179 277240.15827 271093.857 275196.8435 329919.874    11   b  dt_rleid1   2203.942   2404.060   3130.73218   2690.267   3728.9925   4813.783    11    dt_rleid2   2577.370   2665.346   5750.63073   2700.839   2741.0510  36395.429    11      dt_last   1605.098   1627.564   1718.85318   1654.561   1724.6030   2036.296    11     dt_last2   1665.134   1718.372   1945.67645   1764.438   1769.5350   3909.476    11  

the data.table solutions 2 magnitudes faster hack-r's approach rather small problem size of 100 x 100 = 10 k rows. tried run hack-r's solution 1000 x 1000 = 1 m rows had not patience wait result.

for 1 m rows, approach finding last non-zero value 5 6 times faster the rleid() approach.

unit: milliseconds       expr        min         lq      mean     median        uq      max neval cld       copy   6.602008   6.843094  21.23383   7.297889  13.61614 141.5794    11    dt_rleid1  63.282609  70.239165 142.21568 193.972143 199.32077 224.5657    11  b   dt_rleid2 157.939571 281.185658 266.62148 288.184692 291.61445 309.5796    11   c    dt_last  35.826792  39.198781 101.66298  48.387030 172.40187 182.2354    11  b    dt_last2  36.507194  43.754676 103.95414  48.879018 173.66035 183.1639    11  b 

Comments

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

minify - Minimizing css files -