hadoop - Wordcount Nonetype error pyspark- -


i trying text analysis:

def cleaning_text(sentence):    sentence=sentence.lower()    sentence=re.sub('\'','',sentence.strip())    sentence=re.sub('^\d+\/\d+|\s\d+\/\d+|\d+\-\d+\-\d+|\d+\-\w+\-\d+\s\d+\:\d+|\d+\-\w+\-\d+|\d+\/\d+\/\d+\s\d+\:\d+',' ',sentence.strip())# dates removed    sentence=re.sub(r'(.)(\/)(.)',r'\1\3',sentence.strip())    sentence=re.sub("(.*?\//)|(.*?\\\\)|(.*?\\\)|(.*?\/)",' ',sentence.strip())    sentence=re.sub('^\d+','',sentence.strip())    sentence = re.sub('[%s]' % re.escape(string.punctuation),'',sentence.strip())    cleaned=' '.join([w w in sentence.split() if not len(w)<2 , w not in ('no', 'sc','ln') ])    cleaned=cleaned.strip()    if(len(cleaned)<=1):         return "na"    else:        return cleaned  org_val=udf(cleaning_text,stringtype()) df_new =df.withcolumn("cleaned_short_desc", org_val(df["symptom_short_description_"])) df_new =df_new.withcolumn("cleaned_long_desc", org_val(df_new["long_description"])) longwordsdf = (df_new.select(explode(split('cleaned_long_desc',' ')).alias('word')) longwordsdf.count() 

i following error.

file "<stdin>", line 2, in cleaning_text attributeerror: 'nonetype' object has no attribute 'lower'

i want perform word counts kind of aggregation function giving me error.

i tried following things:

sentence=sentence.encode("ascii", "ignore") 

added statement in cleaning_text function

df.dropna() 

its still giving same issue, not know how resolve issue.

it looks have null values in columns. add if @ beginning of cleaning_text function , error disappear:

if sentence none:     return "na" 

Comments

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

minify - Minimizing css files -