python - multi-indexing pandas dataframe -
i wondering how multiple indexes dataframe based on list groups elements column.
since better show example, here script displays have, , want:
def ungroup_column(df, column, split_column = none): ''' # summary takes dataframe column contains lists , spreads items in list on many rows similar pandas.melt(), acts on lists within column # example input datframe: farm_id animals 0 1 [pig, sheep, dog] 1 2 [duck] 2 3 [pig, horse] 3 4 [sheep, horse] output dataframe: farm_id animals 0 1 pig 0 1 sheep 0 1 dog 1 2 duck 2 3 pig 2 3 horse 3 4 sheep 3 4 horse # arguments df: (pandas.dataframe) dataframe act upon column: (string) name of column contains lists separate split_column: (string) column added dataframe containing split items in list if not given, values written on original column ''' if split_column none: split_column = column # split column mulitple columns (one col each item in list) every row # transpose make lists go down rows list_split_matrix = df[column].apply(pd.series).t # columns of `list_split_matrix` (they're integers) # indices of rows in `df` - i.e. `df_row_idx` # melt concats each column on top of each other melted_df = pd.melt(list_split_matrix, var_name = 'df_row_idx', value_name = split_column).dropna().set_index('df_row_idx') if split_column == column: df = df.drop(column, axis = 1) df = df.join(melted_df) else: df = df.join(melted_df) return df ipython.display import display train_df.index utils import * play_df = train_df sent_idx = play_df.groupby('pmid')['sentence'].apply(lambda row: range(0, len(list(row)))) #set_index(['pmid', range(0, len())]) play_df.set_index('pmid') import pandas pd doc_texts = ['here sentence. , another. yet sentence.', 'different document here. other sentences.'] playing_df = pd.dataframe({'doc':[nlp(doc) doc in doc_texts], 'sentences':[[s s in nlp(doc).sents] doc in doc_texts]}) display(playing_df) display(ungroup_column(playing_df, 'sentences'))
the output of follows:
doc sentences 0 (here, is, a, sentence, ., and, another, ., ye... [(here, is, a, sentence, .), (and, another, .)... 1 (different, document, here, ., with, some, oth... [(different, document, here, .), (with, some, ... doc sentences 0 (here, is, a, sentence, ., and, another, ., ye... (here, is, a, sentence, .) 0 (here, is, a, sentence, ., and, another, ., ye... (and, another, .) 0 (here, is, a, sentence, ., and, another, ., ye... (yet, another, sentence, .) 1 (different, document, here, ., with, some, oth... (different, document, here, .) 1 (different, document, here, ., with, some, oth... (with, some, other, sentences, .)
but have index 'sentences' column, such this:
doc_idx sent_idx document sentence 0 0 (here, is, a, sentence, ., and, another, ., ye... (here, is, a, sentence, .) 1 (here, is, a, sentence, ., and, another, ., ye... (and, another, .) 2 (here, is, a, sentence, ., and, another, ., ye... (yet, another, sentence, .) 1 0 (different, document, here, ., with, some, oth... (different, document, here, .) 1 (different, document, here, ., with, some, oth... (with, some, other, sentences, .)
based on second output can reset index , set_index based on cumcount of current index rename axis i.e
new_df = ungroup_column(playing_df, 'sentences').reset_index() new_df['sent_idx'] = new_df.groupby('index').cumcount() new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])
output:
doc sents doc_idx sent_idx 0 0 [here, is, a, sentence, ., and, another, ., ye... here sentence. 1 [here, is, a, sentence, ., and, another, ., ye... , another. 2 [here, is, a, sentence, ., and, another, ., ye... yet sentence. 1 0 [different, document, here, ., with, some, oth... different document here. 1 [different, document, here, ., with, some, oth... other sentences.
instead of applying pd.series can use np.concatenate
expand column.( used nltk token words , sentences)
import nltk import pandas pd doc_texts = ['here sentence. , another. yet sentence.', 'different document here. other sentences.'] playing_df = pd.dataframe({'doc':[nltk.word_tokenize(doc) doc in doc_texts], 'sents':[nltk.sent_tokenize(doc) doc in doc_texts]}) s = playing_df['sents'] = np.arange(len(df)).repeat(s.str.len()) new_df = playing_df.iloc[i, :-1].assign(**{'sents': np.concatenate(s.values)}).reset_index() new_df['sent_idx'] = new_df.groupby('index').cumcount() new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])
hope helps.
Comments
Post a Comment