apache spark - Is casting necessary for string columns which contain Int value for doing arithmetic comparison? -
i using apache spark 2.1.1.
i have dataset follows.
final case class testmodel(id: string, code: string, measure: string, value: string) i loading csv file. different measure values, value datatype different. e.g. if measure 'age', value age in string. processing, casting value integertype , doing comparison age range specified in dataset. following correct way ?
val testdata = spark.read.schema(testschema).option("header", "false").csv(datapath).as[testmodel] val agebasedtestdata = testdata.filter($"measure" === "age") var agebaseddata = agebasedtestdata.join(anotherds, agebasedtestdata("code") === anotherds("code") && anotherds("ages").getitem(0) <= agebasedtestdata("value").cast(integertype) && anotherds("ages").getitem(1) > agebasedtestdata("value").cast(integertype)) .select( column names) is above casting of value column interger type before comparison age range correct way of doing this? converting string int comparison purposes don't care datatype. have ran code both cast , without cast , both give me same results. not sure happening behind scenes without cast. automatically cast string int , comparison. if matters, datatype of "ages" array anotherds dataset integer.
so not sure happening behind scenes without cast.
take @ execution plan:
scala> spark.sql("select 1 < '42'").explain(true) == parsed logical plan == 'project [unresolvedalias((1 < 42), none)] +- onerowrelation$ == analyzed logical plan == (1 < cast(42 int)): boolean project [(1 < cast(42 int)) (1 < cast(42 int))#142] +- onerowrelation$ == optimized logical plan == project [true (1 < cast(42 int))#142] +- onerowrelation$ == physical plan == *project [true (1 < cast(42 int))#142] +- scan onerowrelation[] and
scala> spark.sql("select '42' < 1").explain(true) == parsed logical plan == 'project [unresolvedalias((42 < 1), none)] +- onerowrelation$ == analyzed logical plan == (cast(42 int) < 1): boolean project [(cast(42 int) < 1) (cast(42 int) < 1)#147] +- onerowrelation$ == optimized logical plan == project [false (cast(42 int) < 1)#147] +- onerowrelation$ == physical plan == *project [false (cast(42 int) < 1)#147] +- scan onerowrelation[] so if 1 argument numeric, second 1 casted.
it still recommended cast data desired types avoid mistakes like:
spark.sql("select '42' < '9'") especially if consider, casting rules tricky , bit inconsistent:
scala> spark.sql("select to_date('2015-01-01') < '2012-01-01'").explain == physical plan == *project [false (cast(to_date('2015-01-01') string) < 2012-01-01)#4] +- scan onerowrelation[]
Comments
Post a Comment