python - Multiple variable derivations in pandas group-by object (time series) -

August 15, 2011

i have dataframe according spec. below. actual size around 100 million rows, , 150 columns:

df = pd.dataframe({'id' : [1,1,1,1,3,3,3,3]                ,'time_ref' : ['20150401','20150401','20150401','20150401','20150401','20150401','20150401','20150401']                ,'time' : ['20150101', '20150201','20150301','20150401', '20150101',' 20150201','20150301','20150401']                ,'disc' : [3,3,1,1,2,4,5,7]}               ,columns =['id', 'time_ref','time', 'disc' ]              )   time in ['time_ref','time']:     df[time] = pd.to_datetime(df[time]                           ,format = '%y%m%d'                           ,errors = 'ignore')   df

i have solved problem, appearance below:

df2 = pd.dataframe({'id' : [1,1,1,1,3,3,3,3]                ,'time_ref' : ['20150401','20150401','20150401','20150401','20150401','20150401','20150401','20150401']                ,'time' : ['20150101', '20150201','20150301','20150401', '20150101',' 20150201','20150301','20150401']                ,'disc' : [3,3,1,1,2,4,5,7]                ,'disc_agg_diff' : [-2,-2,-2,-2,2,2,2,2]                ,'disc_agg_time_diff' : [2, 2,2 ,2,1,1,1,1]}              ,columns =['id', 'time_ref','time', 'disc', 'disc_agg_diff','disc_agg_time_diff']              )    time in ['time_ref','time']:     df2[time] = pd.to_datetime(df2[time]                           ,format = '%y%m%d'                           ,errors = 'ignore')   df2

column disc_agg_diff looks difference between first change in variable disc, i.e. (1-3) = -2 @ time point 20150301 id` = 1. (7-2) = 5 @ time point 20150401 id 3.
column disc_agg_time_diff looks @ how many time periods (months) change occurred. id 1, 2 periods. id 3, 1 period.

i getting result lot of groupbys, left joins , numpy.where working way through each ids , monthly data points. problem code big, , if add more variables same derivations becomes bigger , bit messy.

my questions is: can result obtained in clean , efficient manner, code-vise, derivations done (subtractions) can quotes, adding columns, etc. also, speed crucial since data volume big.

many in advance time , considerations!

/swepab

you can use transform. found rather bog post explaining on pbpython.com

something first column

`disc_agg_diff`

def get_disc_agg_diff(disc_column):     return disc_column.diff().replace(0, none).dropna().iloc[0]

depending on data, using for-loop might more efficient .replace, .dropna , iloc

df.groupby('id')['disc'].transform(get_first_diff)

returns

0   -2 1   -2 2   -2 3   -2 4    2 5    2 6    2 7    2

`disc_agg_time_diff`

works approximately same `get_first_diff

def get_disc_agg_time_diff(disc_column):     diff = disc_column.diff().reset_index(drop=true)     diff_reduced = diff.replace(0, none).dropna()     return diff_reduced.index[0]

here well, alternative for-loop, works.

Search This Blog

Single

python - Multiple variable derivations in pandas group-by object (time series) -

`disc_agg_diff`

`disc_agg_time_diff`

Comments

Post a Comment

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -