r - How to address the value in the next row -
i have data set of equity returns thousand companies past 30 years. of these companies "dead" (usually delisted or bankrupt), , therefore have returns = 0. want assign nas returns of these companies, after time "died". in order so, have tried use following code:
if(returns$r == 0 & stri_detect_fixed(returns$company, "dead"), na.rm = true){ returns$r[returns$r == 0 & stri_detect_fixed(returns$company, "dead")] <- na } this works pretty well, unfortunately there on occasion return values equal 0 dead/delisted companies before "died", , these values want remain @ 0.
therefore, need command/if condition telling r want return nas if return in next row equal 0 well. guys have suggestions? hope made problem clear, though know explanation may bit confusing.
reproducible example
returns <- structure(list(date = c("04.09.17", "05.09.17", "06.09.17", "01.09.17", "02.09.17", "03.09.17", "04.09.17", "05.09.17", "06.09.17", "04.09.17", "05.09.17", "06.09.17"), company = c("orkla", "orkla", "orkla", "visma dead 04.09.17", "visma dead 04.09.17", "visma dead 04.09.17", "visma dead 04.09.17", "visma dead 04.09.17", "visma dead 04.09.17", "xnewco", "xnewco", "xnewco"), r = c(0.04, 0, -0.02, 0.01, 0, -0.03, 0, 0, 0, 0.01, 0, 0)), .names = c("date", "company", "r"), row.names = c(na, -12l), class = "data.frame") (edited cover case "live" company has 0 returns @ end of time series)
my data frame returns looks this:
date company r 04.09.17 orkla 0.04 05.09.17 orkla 0.00 06.09.17 orkla -0.02 01.09.17 visma dead 04.09.17 0.01 02.09.17 visma dead 04.09.17 0.00 03.09.17 visma dead 04.09.17 -0.03 04.09.17 visma dead 04.09.17 0.00 05.09.17 visma dead 04.09.17 0.00 06.09.17 visma dead 04.09.17 0.00 04.09.17 xnewco 0.01 05.09.17 xnewco 0.00 06.09.17 xnewco 0.00 i want this:
date company r 04.09.17 orkla 0.04 05.09.17 orkla 0.00 06.09.17 orkla -0.02 01.09.17 visma dead 04.09.17 0.01 02.09.17 visma dead 04.09.17 0.00 03.09.17 visma dead 04.09.17 -0.03 04.09.17 visma dead 04.09.17 na 05.09.17 visma dead 04.09.17 na 06.09.17 visma dead 04.09.17 na 04.09.17 xnewco 0.01 05.09.17 xnewco 0.00 06.09.17 xnewco 0.00 my current code (as you'll see above) wouldn't work, replace return of 0.00 visma 02.09.17 na. need remain 0.00, before visma "died"
edit: while preparing benchmark, noticed condition missing prevent replacement of 0 values @ end of time series companies alive. unfortunately, case not covered original sample data provided op (before editing) went undiscovered. have amended below solutions accordingly.
according op's words i have data set of equity returns thousand companies past 30 years, data set may contain several millions rows (a conservative estimate: 250 business days per year * 2000 companies * 5 years of average existence = 2.5 m rows))
thus, need replace few values without copying whole data set. data.table allows update data in place.
the op has requested find consecutive sequences of zeros @ end of time series of each company , replace these zeros na.
with data.table, there 2 options here:
using rleid() function
library(data.table) # coerce data.table setdt(returns) # convert character dates returns[, date := as.idate(date, "%d.%m.%y")][] # make sure data ordered setorder(returns, company, date)[] returns[, company %like% "dead" & r == 0 & rleid(r == 0) == max(rleid(r == 0)), = company] for each dead company, last sequence of 0 values picked:
company v1 1: orkla false 2: orkla false 3: orkla false 4: visma dead 04.09.17 false 5: visma dead 04.09.17 false 6: visma dead 04.09.17 false 7: visma dead 04.09.17 true 8: visma dead 04.09.17 true 9: visma dead 04.09.17 true 10: xnewco false 11: xnewco false 12: xnewco false
the v1 column used subset , update dt in place:
returns[returns[, company %like% "dead" & r == 0 & rleid(r == 0) == max(rleid(r == 0)), = company]$v1, r := na_real_][] date company r 1: 2017-09-04 orkla 0.04 2: 2017-09-05 orkla 0.00 3: 2017-09-06 orkla -0.02 4: 2017-09-01 visma dead 04.09.17 0.01 5: 2017-09-02 visma dead 04.09.17 0.00 6: 2017-09-03 visma dead 04.09.17 -0.03 7: 2017-09-04 visma dead 04.09.17 na 8: 2017-09-05 visma dead 04.09.17 na 9: 2017-09-06 visma dead 04.09.17 na 10: 2017-09-04 xnewco 0.01 11: 2017-09-05 xnewco 0.00 12: 2017-09-06 xnewco 0.00
finding index of last non-zero value
returns[, {tmp <- last(which(r != 0)) if (company %like% "dead" & tmp < .n) .i[seq.int(tmp + 1l, .n)]}, = company] here, position of last non-zero value of each time series picked used construct indices remaining 0 values dead companies. .i , .n special symbols in data.table syntax. check if (company %like% "dead" & tmp < .n) required in case there no 0 values @ end of time series of dead company.
company v1 1: visma dead 04.09.17 7 2: visma dead 04.09.17 8 3: visma dead 04.09.17 9
as above, v1 used subset , update returns in place:
returns[returns[, {tmp <- last(which(r != 0)) if (company %like% "dead" & tmp < .n) .i[seq.int(tmp + 1l, .n)]}, = company]$v1, r := na_real_][] date company r 1: 2017-09-04 orkla 0.04 2: 2017-09-05 orkla 0.00 3: 2017-09-06 orkla -0.02 4: 2017-09-01 visma dead 04.09.17 0.01 5: 2017-09-02 visma dead 04.09.17 0.00 6: 2017-09-03 visma dead 04.09.17 -0.03 7: 2017-09-04 visma dead 04.09.17 na 8: 2017-09-05 visma dead 04.09.17 na 9: 2017-09-06 visma dead 04.09.17 na 10: 2017-09-04 xnewco 0.01 11: 2017-09-05 xnewco 0.00 12: 2017-09-06 xnewco 0.00
benchmark
hack-r claimed his solutions should perform perhaps 1m rows. so, wanted verify claim benchmark.
creating benchmark data
library(data.table) # create benchmark data n_days <- 100l n_comp <- 100l n_dead <- round(0.1 * n_comp) # 10 per cent of companies dead date <- seq(from = as.idate("2015-01-01"), length.out = n_days, = "1 day") # company "names" consist of 4 digits @ least company <- sprintf("%04i", seq_len(n_comp)) # cross join create combinations returns <- cj(date = date, company = company) set.seed(1l) # reuired reproducible result returns[, r := round(rnorm(.n)/10.0, 2l)][] # dead companies dead <- data.table(company = sample(company, n_dead), dead.date = sample(date, n_dead)) # modify returns returns[dead, on = .(company, date >= dead.date), r := 0] # modify compay names returns[dead, on = "company", company := paste(company, "dead", dead.date)] # important: set order setorder(returns, company, date) # keep original version r0 <- copy(returns) benchmark code
microbenchmark::microbenchmark( copy = returns <- copy(r0), hackr1 = { mydat <- setdf(copy(r0)) for(i in 1:nrow(mydat)){ if(i==nrow(mydat) & mydat$r[i]==0) {mydat$r[i] <- na } else if(!is.na(mydat$r[i]) & mydat$r[i]==0 & mydat$r[i+1]==0) mydat$r[i] <- na } res_hackr1 <- mydat }, hackr2 = { mydat <- copy(r0) tmp0 <- mydat[0,] for(c in unique(mydat$company)){ tmp <- mydat[mydat$company==c,] for(i in 1:nrow(tmp)){ if(!is.na(tmp$r[i]) & tmp$r[i]==0 & tmp$r[i+1]==0){ tmp$r[i:nrow(tmp)] <- na } } tmp0 <- rbind(tmp0, tmp) } res_hackr2 <- tmp0 }, dt_rleid1 = { returns <- copy(r0) returns[returns[, company %like% "dead" & r == 0 & rleid(r == 0) == max(rleid(r == 0)), = company]$v1, r := na_real_] res_dt_rleid1 <- copy(returns) }, dt_rleid2 = { returns <- copy(r0) returns[company %like% "dead" & returns[, r == 0 & rleid(r == 0) == max(rleid(r == 0)), = company]$v1, r := na_real_] res_dt_rleid2 <- copy(returns) }, dt_last = { returns <- copy(r0) returns[returns[, { tmp <- last(which(r != 0)) if (company %like% "dead") .i[tmp + seq_len(.n - tmp)] }, = company]$v1, r := na_real_] res_dt_last <- copy(returns) }, dt_last2 = { returns <- copy(r0) returns[returns[, { tmp <- last(which(r != 0)) if (company %like% "dead" & tmp < .n) .i[seq.int(tmp + 1l, .n)] }, = company]$v1, r := na_real_] res_dt_last2 <- copy(returns) }, times = 11l ) as codes modify data set in place, copy() used create "fresh" unmodified data set before each run , store result later comparison. therefore, copy() timed well.
dt_rleid1 , dt_rleid2 dt_last , dt_last2 code variations of respective solutions.
benchmark results
unfortunately, hackr2 stopped execution error message:
error in if (!is.na(tmp$r[i]) & tmp$r[i] == 0 & tmp$r[i + 1] == 0) { :
missing value true/false needed
the timings remaining solutions are:
unit: microseconds expr min lq mean median uq max neval cld copy 46.065 48.331 53.75427 52.485 58.1475 66.077 11 hackr1 267515.143 269559.179 277240.15827 271093.857 275196.8435 329919.874 11 b dt_rleid1 2203.942 2404.060 3130.73218 2690.267 3728.9925 4813.783 11 dt_rleid2 2577.370 2665.346 5750.63073 2700.839 2741.0510 36395.429 11 dt_last 1605.098 1627.564 1718.85318 1654.561 1724.6030 2036.296 11 dt_last2 1665.134 1718.372 1945.67645 1764.438 1769.5350 3909.476 11
the data.table solutions 2 magnitudes faster hack-r's approach rather small problem size of 100 x 100 = 10 k rows. tried run hack-r's solution 1000 x 1000 = 1 m rows had not patience wait result.
for 1 m rows, approach finding last non-zero value 5 6 times faster the rleid() approach.
unit: milliseconds expr min lq mean median uq max neval cld copy 6.602008 6.843094 21.23383 7.297889 13.61614 141.5794 11 dt_rleid1 63.282609 70.239165 142.21568 193.972143 199.32077 224.5657 11 b dt_rleid2 157.939571 281.185658 266.62148 288.184692 291.61445 309.5796 11 c dt_last 35.826792 39.198781 101.66298 48.387030 172.40187 182.2354 11 b dt_last2 36.507194 43.754676 103.95414 48.879018 173.66035 183.1639 11 b
Comments
Post a Comment