java - Finding Common Entities -

March 15, 2015

i have file following data contains ids related such as:

 [id1, [id2, id3, id4,...,idn]]  [id2, [id1, id4, id5,...,idn]]  [id3, [id6, id9, id25,...,idn]]

with first element being key , subsequent list being of variable length of id's associated key. real numbers:

 [1, [2, 23, 47,...,59]]  [2, [1, 4, 5,...,67]]  [3, [6, 9, 23,...,40]]

i'm looking way determine how many key ids have n number of shared ids using java spark.

so example let's definitive file looks like:

 [1, [2, 23, 47,109, 59, 12]]  [2, [1, 4, 5, 47, 48, 49, 50, 51, 52,67]]  [3, [6, 7, 9, 23, 48, 49, 50, 51, 52, 40]]  [24, [6, 7, 9, 23, 48, 49, 50, 51, 52, 40, 199, 201, 222, 223]]  [99, [1,2,3]]

and let's suppose i'm looking see key id's have @ least 5 shared id's - in case 2,3,24 have @ least 5 shared ids'. or if specified n 10, 3 & 24 have @ least 10 shared id's, on , forth.

thoughts? have way this, requires calls collect() i'd avoid. have experience pyspark, java spark new me. appreciate always!

edit: adding code per request following solution "collect()" being called:

 sparksession spark =   sparksession.builder().appname("testname").master("local[*]").getorcreate();  javasparkcontext sparkcontext = new javasparkcontext(spark.sparkcontext());  dataset<row> ids = spark.read().option("sep",":")                     .csv("<filename>").todf(); ids = ids.withcolumn("id",functions        .row_number().over(window.orderby("_c0")));; ids = ids.withcolumnrenamed("_c0", "key").withcolumnrenamed("_c1", "values"); ids.show();  for(int i=1; i<ids.count(); i++){     dataset<row> values = ids.select("values").where(col("key").equalto(i));     javardd<object> bodyrdd = values.tojavardd().map(x -> x.get(0));     list<object> = bodyrdd.collect();     arraylist<string> idvalues = parsestring((string)a.get(0));     int counter = 0;     for(int j = i+1; j<ids.count(); j++){         dataset<row> values2 =                      ids.select("values").where(col("key").equalto(j));         javardd<object> bodyrdd2 = values2.tojavardd().map(x -> x.get(0));         list<object> b = bodyrdd2.collect();         arraylist<string> idvalues2 = parsestring((string)b.get(0));          //compare idvalues idvalues2, count results , store in map      } }

however, solution requires calls collect() i'd avoid.

also, "parsestring()" helper method parse context of file list of values list (right reads in string & not sure how fix schema yet).

Search This Blog

Single

java - Finding Common Entities -

Comments

Post a Comment

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

linux - Why does bash short curcuit fail in crontab? -