python - multi-indexing pandas dataframe -

i wondering how multiple indexes dataframe based on list groups elements column.

since better show example, here script displays have, , want:

def ungroup_column(df, column, split_column = none):     '''     # summary         takes dataframe column contains lists , spreads items in list on many rows         similar pandas.melt(), acts on lists within column      # example          input datframe:                  farm_id animals             0   1       [pig, sheep, dog]             1   2       [duck]             2   3       [pig, horse]             3   4       [sheep, horse]           output dataframe:                  farm_id animals             0   1       pig             0   1       sheep             0   1       dog             1   2       duck             2   3       pig             2   3       horse             3   4       sheep             3   4       horse      # arguments          df: (pandas.dataframe)             dataframe act upon          column: (string)             name of column contains lists separate          split_column: (string)             column added dataframe containing split items in list             if not given, values written on original column     '''     if split_column none:         split_column = column      # split column mulitple columns (one col each item in list) every row     # transpose make lists go down rows     list_split_matrix = df[column].apply(pd.series).t      # columns of `list_split_matrix` (they're integers)     # indices of rows in `df` - i.e. `df_row_idx`     # melt concats each column on top of each other     melted_df = pd.melt(list_split_matrix, var_name = 'df_row_idx', value_name = split_column).dropna().set_index('df_row_idx')      if split_column == column:         df = df.drop(column, axis = 1)         df = df.join(melted_df)     else:         df = df.join(melted_df)     return df  ipython.display import display train_df.index utils import * play_df = train_df sent_idx = play_df.groupby('pmid')['sentence'].apply(lambda row: range(0, len(list(row)))) #set_index(['pmid', range(0, len())]) play_df.set_index('pmid')  import pandas pd doc_texts = ['here sentence. , another. yet sentence.',             'different document here. other sentences.'] playing_df = pd.dataframe({'doc':[nlp(doc) doc in doc_texts],                            'sentences':[[s s in nlp(doc).sents] doc in doc_texts]}) display(playing_df) display(ungroup_column(playing_df, 'sentences'))

the output of follows:

doc sentences 0   (here, is, a, sentence, ., and, another, ., ye...   [(here, is, a, sentence, .), (and, another, .)... 1   (different, document, here, ., with, some, oth...   [(different, document, here, .), (with, some, ... doc sentences 0   (here, is, a, sentence, ., and, another, ., ye...   (here, is, a, sentence, .) 0   (here, is, a, sentence, ., and, another, ., ye...   (and, another, .) 0   (here, is, a, sentence, ., and, another, ., ye...   (yet, another, sentence, .) 1   (different, document, here, ., with, some, oth...   (different, document, here, .) 1   (different, document, here, ., with, some, oth...   (with, some, other, sentences, .)

but have index 'sentences' column, such this:

doc_idx   sent_idx     document                                           sentence 0         0            (here, is, a, sentence, ., and, another, ., ye...   (here, is, a, sentence, .)           1            (here, is, a, sentence, ., and, another, ., ye...   (and, another, .)           2            (here, is, a, sentence, ., and, another, ., ye...   (yet, another, sentence, .) 1         0            (different, document, here, ., with, some, oth...   (different, document, here, .)           1            (different, document, here, ., with, some, oth...   (with, some, other, sentences, .)

based on second output can reset index , set_index based on cumcount of current index rename axis i.e

new_df = ungroup_column(playing_df, 'sentences').reset_index() new_df['sent_idx'] = new_df.groupby('index').cumcount()  new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])

output:

                                                                doc       sents doc_idx sent_idx                                                       0       0         [here, is, a, sentence, ., and, another, ., ye...     here sentence.         1         [here, is, a, sentence, ., and, another, ., ye...     , another.           2         [here, is, a, sentence, ., and, another, ., ye...     yet sentence.   1       0         [different, document, here, ., with, some, oth...     different document here.         1         [different, document, here, ., with, some, oth...     other sentences.

instead of applying pd.series can use np.concatenate expand column.( used nltk token words , sentences)

import nltk import pandas pd doc_texts = ['here sentence. , another. yet sentence.',         'different document here. other sentences.'] playing_df = pd.dataframe({'doc':[nltk.word_tokenize(doc) doc in doc_texts],                       'sents':[nltk.sent_tokenize(doc) doc in doc_texts]})  s = playing_df['sents'] = np.arange(len(df)).repeat(s.str.len())  new_df = playing_df.iloc[i, :-1].assign(**{'sents': np.concatenate(s.values)}).reset_index()  new_df['sent_idx'] = new_df.groupby('index').cumcount() new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])

hope helps.

Search This Blog

Single

python - multi-indexing pandas dataframe -

Comments

Post a Comment