apache spark - GC overhead limit exceeded on PySpark -


i'm working on huge logs pyspark , i'm facing memory issues on cluster.

it gaves me following error :

http error 500

problem accessing /jobs/. reason:

server error caused by:

java.lang.outofmemoryerror: gc overhead limit exceeded

here's current configuration :

spark.driver.cores  3 spark.driver.memory 6g spark.executor.cores    3 spark.executor.instances    20 spark.executor.memory   6g spark.yarn.executor.memoryoverhead  2g 

first, don't cache/persist in spark job.

i've read memoryoverhead, that's why i've increased it. seems it's not enough. i've read issue garbage collector. , that's main question here, what's best practice when have deal many different databases.

i have lot of join, i'm doing sparksql, , i'm creating lot of tempviews. bad practice ? better make huge sql requests , 10 joins inside 1 sql request ? reduce code readability solve problem ?

thanks,

well, think i've fixed issue. broadcasting.

i think joins pretty big, need quite time, i've disabled broadcasting :

config("spark.sql.autobroadcastjointhreshold", "-1") 

problem seems solved.

thanks,


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -