python - Not able to access class methods from pyspark RDD's map method -


while integrating pyspark in application's code-base, couldn't refer class's method in rdd's map method. duplicated issue simple example follows

here's dummy class that, have defined adds number every element of rdd derived rdd class attribute:

class test:      def __init__(self):         self.sc = sparkcontext()         = [('a', 1), ('b', 2), ('c', 3)]         self.a_r = self.sc.parallelize(a)      def add(self, a, b):         return + b      def test_func(self, b):         c_r = self.a_r.map(lambda l: (l[0], l[1] * 2))         v = c_r.map(lambda l: self.add(l[1], b))         v_c = v.collect()         return v_c 

test_func() calls map() method on rdd v, in-turn calls add() method on every element of v. calling test_func() throws following error:

pickle.picklingerror: not serialize object: exception: appears attempting reference sparkcontext broadcast variable, action, or transformation. sparkcontext can used on driver, not in code run on workers. more information, see spark-5063. 

now, when move add() method out of class like:

def add(self, a, b):     return + b  class test:      def __init__(self):         self.sc = sparkcontext()         = [('a', 1), ('b', 2), ('c', 3)]         self.a_r = self.sc.parallelize(a)      def test_func(self, b):          c_r = self.a_r.map(lambda l: (l[0], l[1] * 2))         v = c_r.map(lambda l: add(l[1], b))         v_c = v.collect()          return v_c 

calling test_func() works now.

[7, 9, 11] 

why happen , how can pass class methods rdd's map() method?


Comments

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

minify - Minimizing css files -