python - Not able to access class methods from pyspark RDD's map method -
while integrating pyspark in application's code-base, couldn't refer class's method in rdd's map method. duplicated issue simple example follows
here's dummy class that, have defined adds number every element of rdd derived rdd class attribute:
class test: def __init__(self): self.sc = sparkcontext() = [('a', 1), ('b', 2), ('c', 3)] self.a_r = self.sc.parallelize(a) def add(self, a, b): return + b def test_func(self, b): c_r = self.a_r.map(lambda l: (l[0], l[1] * 2)) v = c_r.map(lambda l: self.add(l[1], b)) v_c = v.collect() return v_c test_func() calls map() method on rdd v, in-turn calls add() method on every element of v. calling test_func() throws following error:
pickle.picklingerror: not serialize object: exception: appears attempting reference sparkcontext broadcast variable, action, or transformation. sparkcontext can used on driver, not in code run on workers. more information, see spark-5063. now, when move add() method out of class like:
def add(self, a, b): return + b class test: def __init__(self): self.sc = sparkcontext() = [('a', 1), ('b', 2), ('c', 3)] self.a_r = self.sc.parallelize(a) def test_func(self, b): c_r = self.a_r.map(lambda l: (l[0], l[1] * 2)) v = c_r.map(lambda l: add(l[1], b)) v_c = v.collect() return v_c calling test_func() works now.
[7, 9, 11] why happen , how can pass class methods rdd's map() method?
Comments
Post a Comment