apache spark - Is casting necessary for string columns which contain Int value for doing arithmetic comparison? -

January 15, 2010

i using apache spark 2.1.1.

i have dataset follows.

final case class testmodel(id: string,                        code: string,                         measure: string,                         value: string)

i loading csv file. different measure values, value datatype different. e.g. if measure 'age', value age in string. processing, casting value integertype , doing comparison age range specified in dataset. following correct way ?

val testdata = spark.read.schema(testschema).option("header", "false").csv(datapath).as[testmodel] val agebasedtestdata = testdata.filter($"measure" === "age")   var agebaseddata = agebasedtestdata.join(anotherds, agebasedtestdata("code") === anotherds("code") &&                                anotherds("ages").getitem(0) <= agebasedtestdata("value").cast(integertype) &&                                anotherds("ages").getitem(1) > agebasedtestdata("value").cast(integertype))                                .select( column names)

is above casting of value column interger type before comparison age range correct way of doing this? converting string int comparison purposes don't care datatype. have ran code both cast , without cast , both give me same results. not sure happening behind scenes without cast. automatically cast string int , comparison. if matters, datatype of "ages" array anotherds dataset integer.

so not sure happening behind scenes without cast.

take @ execution plan:

scala> spark.sql("select 1 < '42'").explain(true) == parsed logical plan == 'project [unresolvedalias((1 < 42), none)] +- onerowrelation$  == analyzed logical plan == (1 < cast(42 int)): boolean project [(1 < cast(42 int)) (1 < cast(42 int))#142] +- onerowrelation$  == optimized logical plan == project [true (1 < cast(42 int))#142] +- onerowrelation$  == physical plan == *project [true (1 < cast(42 int))#142] +- scan onerowrelation[]

and

scala> spark.sql("select '42' < 1").explain(true) == parsed logical plan == 'project [unresolvedalias((42 < 1), none)] +- onerowrelation$  == analyzed logical plan == (cast(42 int) < 1): boolean project [(cast(42 int) < 1) (cast(42 int) < 1)#147] +- onerowrelation$  == optimized logical plan == project [false (cast(42 int) < 1)#147] +- onerowrelation$  == physical plan == *project [false (cast(42 int) < 1)#147] +- scan onerowrelation[]

so if 1 argument numeric, second 1 casted.

it still recommended cast data desired types avoid mistakes like:

spark.sql("select '42' < '9'")

especially if consider, casting rules tricky , bit inconsistent:

scala> spark.sql("select to_date('2015-01-01') < '2012-01-01'").explain == physical plan == *project [false (cast(to_date('2015-01-01') string) < 2012-01-01)#4] +- scan onerowrelation[]

Search This Blog

Single

apache spark - Is casting necessary for string columns which contain Int value for doing arithmetic comparison? -

Comments

Post a Comment

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

linux - Why does bash short curcuit fail in crontab? -