java - Finding Common Entities -
i have file following data contains ids related such as:
[id1, [id2, id3, id4,...,idn]] [id2, [id1, id4, id5,...,idn]] [id3, [id6, id9, id25,...,idn]] with first element being key , subsequent list being of variable length of id's associated key. real numbers:
[1, [2, 23, 47,...,59]] [2, [1, 4, 5,...,67]] [3, [6, 9, 23,...,40]] i'm looking way determine how many key ids have n number of shared ids using java spark.
so example let's definitive file looks like:
[1, [2, 23, 47,109, 59, 12]] [2, [1, 4, 5, 47, 48, 49, 50, 51, 52,67]] [3, [6, 7, 9, 23, 48, 49, 50, 51, 52, 40]] [24, [6, 7, 9, 23, 48, 49, 50, 51, 52, 40, 199, 201, 222, 223]] [99, [1,2,3]] and let's suppose i'm looking see key id's have @ least 5 shared id's - in case 2,3,24 have @ least 5 shared ids'. or if specified n 10, 3 & 24 have @ least 10 shared id's, on , forth.
thoughts? have way this, requires calls collect() i'd avoid. have experience pyspark, java spark new me. appreciate always!
edit: adding code per request following solution "collect()" being called:
sparksession spark = sparksession.builder().appname("testname").master("local[*]").getorcreate(); javasparkcontext sparkcontext = new javasparkcontext(spark.sparkcontext()); dataset<row> ids = spark.read().option("sep",":") .csv("<filename>").todf(); ids = ids.withcolumn("id",functions .row_number().over(window.orderby("_c0")));; ids = ids.withcolumnrenamed("_c0", "key").withcolumnrenamed("_c1", "values"); ids.show(); for(int i=1; i<ids.count(); i++){ dataset<row> values = ids.select("values").where(col("key").equalto(i)); javardd<object> bodyrdd = values.tojavardd().map(x -> x.get(0)); list<object> = bodyrdd.collect(); arraylist<string> idvalues = parsestring((string)a.get(0)); int counter = 0; for(int j = i+1; j<ids.count(); j++){ dataset<row> values2 = ids.select("values").where(col("key").equalto(j)); javardd<object> bodyrdd2 = values2.tojavardd().map(x -> x.get(0)); list<object> b = bodyrdd2.collect(); arraylist<string> idvalues2 = parsestring((string)b.get(0)); //compare idvalues idvalues2, count results , store in map } } however, solution requires calls collect() i'd avoid.
also, "parsestring()" helper method parse context of file list of values list (right reads in string & not sure how fix schema yet).
Comments
Post a Comment