Preserving the number of partitions of a spark dataframe after transformation -

January 15, 2014

i looking @ bug in code dataframe has been split many partitions desired (over 700), , causes many shuffle operations when try repartition them 48. can't use coalesce() here because want have fewer partitions in first place before repartition.

i looking @ ways reduce number of partitions. let's have spark dataframe (with multiple columns) divided 10 partitions. need orderby transformation based on 1 of columns. after operation done, resulting dataframe have same number of partitions? if not, how spark decide on number of partitions?

also other transformations cause change in number of partitions dataframe, need aware of, other obvious ones repartition()?

number of partitions operations requiring exchange defined spark.sql.shuffle.partitions. if want particular value should set before executing command:

scala> val df = spark.range(0, 1000) df: org.apache.spark.sql.dataset[long] = [id: bigint]  scala> spark.conf.set("spark.sql.shuffle.partitions", 1)  scala> df.orderby("id").rdd.getnumpartitions res1: int = 1  scala> spark.conf.set("spark.sql.shuffle.partitions", 42)  scala> df.orderby("id").rdd.getnumpartitions res3: int = 42

Search This Blog

Single

Preserving the number of partitions of a spark dataframe after transformation -

Comments

Post a Comment

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

linux - Why does bash short curcuit fail in crontab? -