hadoop - Wordcount Nonetype error pyspark- -
i trying text analysis:
def cleaning_text(sentence): sentence=sentence.lower() sentence=re.sub('\'','',sentence.strip()) sentence=re.sub('^\d+\/\d+|\s\d+\/\d+|\d+\-\d+\-\d+|\d+\-\w+\-\d+\s\d+\:\d+|\d+\-\w+\-\d+|\d+\/\d+\/\d+\s\d+\:\d+',' ',sentence.strip())# dates removed sentence=re.sub(r'(.)(\/)(.)',r'\1\3',sentence.strip()) sentence=re.sub("(.*?\//)|(.*?\\\\)|(.*?\\\)|(.*?\/)",' ',sentence.strip()) sentence=re.sub('^\d+','',sentence.strip()) sentence = re.sub('[%s]' % re.escape(string.punctuation),'',sentence.strip()) cleaned=' '.join([w w in sentence.split() if not len(w)<2 , w not in ('no', 'sc','ln') ]) cleaned=cleaned.strip() if(len(cleaned)<=1): return "na" else: return cleaned org_val=udf(cleaning_text,stringtype()) df_new =df.withcolumn("cleaned_short_desc", org_val(df["symptom_short_description_"])) df_new =df_new.withcolumn("cleaned_long_desc", org_val(df_new["long_description"])) longwordsdf = (df_new.select(explode(split('cleaned_long_desc',' ')).alias('word')) longwordsdf.count() i following error.
file "<stdin>", line 2, in cleaning_text attributeerror: 'nonetype' object has no attribute 'lower'
i want perform word counts kind of aggregation function giving me error.
i tried following things:
sentence=sentence.encode("ascii", "ignore") added statement in cleaning_text function
df.dropna() its still giving same issue, not know how resolve issue.
it looks have null values in columns. add if @ beginning of cleaning_text function , error disappear:
if sentence none: return "na"
Comments
Post a Comment