amazon web services - How to speed up the run time of computation heavy python programs -


i want know how speed running of programs involving lot of computations on large datasets. so, have 5 python programs, each of them perform convoluted computations on large dataset. example, portion in 1 of program follows:

df = get_data_from_redshift() cols = [{'col': 'col1', 'func': pd.series.nunique},  {'col': 'col2', 'func': pd.series.nunique}, {'col': 'col3', 'func': lambda x: x.value_counts().to_dict()}, {'col': 'col4', 'func': pd.series.nunique}, {'col': 'col5', 'func': pd.series.nunique}]  d = df.groupby('column_name').apply(lambda x: tuple(c['func'](x[c['col']]) c in cols)).to_dict() 

where get_data_from_redshift() connects redshift cluster, data database, writes dataframe (the dataframe 600,000 rows x 6 columns).

the other programs use dataframe df , perform lot of computation , each program write result in pickle file.

the final program loads pickle files created 5 programs, computations 300,000 values , checks them against database in cluster, final file output.

running each program individually takes hours (sometimes overnight). however, need whole thing runs within hour , gives me final output file.

i tried putting 1 of programs on ec2 instance see if performance improves, has been on 3 hours , it's still running. tried m4.xlarge, c4.xlarge, r4.xlarge instances, none of them useful.

is there way speed total run time?

maybe run each of 5 programs on separate ec2 instances, each of program give output file, final program has use. so, if run on multiple instances, output files each program save on different servers, right? how final program use them? can save output file each file common location final program can access?

i've heard of gpus being 14 times faster cpus, i've never used them. using gpu instance of in case?

sorry, i'm new here, don't know how go it.

you need find out what's making slow, can use profiler start if can not think of else @ moment. finding out exact problem simplest way of making work better.

this, below generic approach.

first thing, optimizations in architecture/algorithms can substantially outperform other optimization (like provided programming languages, tools, techniques memoization etc.). first thoroughly if algorithm can improved. includes looking parts can run concurrently, can executed parallelly. example, using map-reduce instead of linear data processing can lower execution times fractions! needs processing should able divided (mutually exclusively) parallel processing.

next should finding unnecessary loops or computations. using techniques memoization can improve performance greatly.

in case if there communication or i/o tasks (eg. communication redshift cluster you've mentioned) time taking (this doesn't seem relate you've shown concerns computation being slow)

then there minor optimization, using functions map, filter or using generator expressions instead of lists etc can optimize (very) slightly.


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -