python - Multiple variable derivations in pandas group-by object (time series) -
i have dataframe according spec. below. actual size around 100 million rows, , 150 columns:
df = pd.dataframe({'id' : [1,1,1,1,3,3,3,3] ,'time_ref' : ['20150401','20150401','20150401','20150401','20150401','20150401','20150401','20150401'] ,'time' : ['20150101', '20150201','20150301','20150401', '20150101',' 20150201','20150301','20150401'] ,'disc' : [3,3,1,1,2,4,5,7]} ,columns =['id', 'time_ref','time', 'disc' ] ) time in ['time_ref','time']: df[time] = pd.to_datetime(df[time] ,format = '%y%m%d' ,errors = 'ignore') df
i have solved problem, appearance below:
df2 = pd.dataframe({'id' : [1,1,1,1,3,3,3,3] ,'time_ref' : ['20150401','20150401','20150401','20150401','20150401','20150401','20150401','20150401'] ,'time' : ['20150101', '20150201','20150301','20150401', '20150101',' 20150201','20150301','20150401'] ,'disc' : [3,3,1,1,2,4,5,7] ,'disc_agg_diff' : [-2,-2,-2,-2,2,2,2,2] ,'disc_agg_time_diff' : [2, 2,2 ,2,1,1,1,1]} ,columns =['id', 'time_ref','time', 'disc', 'disc_agg_diff','disc_agg_time_diff'] ) time in ['time_ref','time']: df2[time] = pd.to_datetime(df2[time] ,format = '%y%m%d' ,errors = 'ignore') df2
column
disc_agg_diff
looks difference between first change in variabledisc
, i.e. (1-3) = -2 @ time point20150301
id` = 1. (7-2) = 5 @ time point 20150401 id 3.column
disc_agg_time_diff
looks @ how many time periods (months) change occurred. id 1, 2 periods. id 3, 1 period.
i getting result lot of groupbys, left joins , numpy.where working way through each ids , monthly data points. problem code big, , if add more variables same derivations becomes bigger , bit messy.
my questions is: can result obtained in clean , efficient manner, code-vise, derivations done (subtractions) can quotes, adding columns, etc. also, speed crucial since data volume big.
many in advance time , considerations!
/swepab
you can use transform
. found rather bog post explaining on pbpython.com
something first column
disc_agg_diff
def get_disc_agg_diff(disc_column): return disc_column.diff().replace(0, none).dropna().iloc[0]
depending on data, using for-loop might more efficient .replace
, .dropna
, iloc
df.groupby('id')['disc'].transform(get_first_diff)
returns
0 -2 1 -2 2 -2 3 -2 4 2 5 2 6 2 7 2
disc_agg_time_diff
works approximately same `get_first_diff
def get_disc_agg_time_diff(disc_column): diff = disc_column.diff().reset_index(drop=true) diff_reduced = diff.replace(0, none).dropna() return diff_reduced.index[0]
here well, alternative for-loop, works.
Comments
Post a Comment