python - Run a loop off of a grouping of data in a pandas DataFrame -
i have large file of vendor records looking duplicates in. duplicate values may appear in either vendor name or vendor address. using dupandas package need break data smaller chunks in order not overwhelm computer's memory. because possible duplicate records must have same state, trying split vendor dataframe state field , run dupandas on each state. having difficulty, however, figuring out how write loop accomplish following: 1. take rows first state (e.g. ak) 2. run dupandas on , write result dataframe 3. move next state (e.g. ca) , repeat step 2.
it seems should straight forward reason totally stumping me.
here non-loop form works artificially created subset vendors in state of wa:
import os os import rename,listdir import pandas pd pandas import read_fwf, dataframe, series import csv dupandas import dedupe #set directory loc1 = os.chdir('c:/') #find file , open vendor = pd.read_csv('vendor.txt', sep = '\t') #fix or coded zeros vendor.loc[vendor['province'] == '0r', 'province'] = 'or' vendor['province'] = vendor['province'].fillna('na') #create 1 field concatenate of vendor address vendor['addr'] = vendor['address1']+' '+vendor['city']+' '+vendor['province']+' '+vendor['associate ein'] vendor['subset'] = vendor['province']+' '+vendor['business activity code'] #test dupandas on vendors in state of wa or ct wa = vendor.loc[((vendor['province'] == 'wa') | (vendor['province'] == 'ct')) & (vendor['business activity code'].str.contains('vrt'))] #initialize configurations dupandas clean_config = { 'lower' : true, 'punctuation' : true, 'whitespace' : true, 'digit' : false } match_config = { 'exact' : false, 'levenshtein' : true, 'soundex' : false, 'nysiis' : false} dupe = dedupe(clean_config = clean_config, match_config = match_config) input_config = { 'input_data' : wa, 'column' : 'vndr_nm', '_id' : 'vndr_cd' }
results = dupe.dedupe(input_config)
Comments
Post a Comment