apache spark - GC overhead limit exceeded on PySpark -
i'm working on huge logs pyspark , i'm facing memory issues on cluster.
it gaves me following error :
http error 500
problem accessing /jobs/. reason:
server error caused by:
java.lang.outofmemoryerror: gc overhead limit exceeded
here's current configuration :
spark.driver.cores 3 spark.driver.memory 6g spark.executor.cores 3 spark.executor.instances 20 spark.executor.memory 6g spark.yarn.executor.memoryoverhead 2g
first, don't cache/persist in spark job.
i've read memoryoverhead, that's why i've increased it. seems it's not enough. i've read issue garbage collector. , that's main question here, what's best practice when have deal many different databases.
i have lot of join, i'm doing sparksql, , i'm creating lot of tempviews. bad practice ? better make huge sql requests , 10 joins inside 1 sql request ? reduce code readability solve problem ?
thanks,
well, think i've fixed issue. broadcasting.
i think joins pretty big, need quite time, i've disabled broadcasting :
config("spark.sql.autobroadcastjointhreshold", "-1")
problem seems solved.
thanks,
Comments
Post a Comment