python - Sentence matching with gensim word2vec: manually populated model doesn't work -


i'm trying solve problem of sentence comparison using naive approach of summing word vectors , comparing results. goal match people interest, dataset consists of names , short sentences describing hobbies. batches small, few hundreds of people, wanted give try before digging doc2vec.

i prepare data cleaning completely, removing stop words, tokenizing , lemmatizing. use pre-trained model word vectors returns adequate results when finding similarities test words. tried summing sentence words find similarities in original model - matches make sense. similarities around general sense of phrase.

for sentence matching i'm trying following: create empty model

b = gs.models.word2vec(min_count=1, size=300, sample=0, hs=0) 

build vocab out of names (or person id's), no training

#first create vocab empty vector test = [['test']] b.build_vocab(test) b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0  #populate vocab array b.build_vocab([personids], update=true) 

summ each sentence's word vectors , store results model corresponding id

#sentences pulled pandas dataset df. 'a' pre-trained model use vectors each word  def summ(phrase, start_model):     '''     vector addition function     '''     #starting vector of 0's     sum_vec = start_model.word_vec("cat_noun")*0     word in phrase:         sum_vec += start_model.word_vec(word)     return sum_vec  i, row in df.iterrows():     try:         personid = row["id"]         summvec = summ(df.iloc[i,1],a)         #updating syn0 each name/id in vocabulary         b.wv.syn0[b.wv.vocab[personid].index] = summvec     except:         pass 

i understand shouldn't expecting accuracy here, t-sne print doesn't show clustering whatsoever. finding similarities method fails find matches (<0.2 similarity coefficient everything). []plot of entire model[1]

wondering if has idea of did go wrong? approach valid @ all?

your code, shown, neither train() of word-vectors (using local text), nor pre-load vectors elsewhere. vectors exist – created build_vocab() calls – still in randomly-initialized starting locations, , useless semantic purposes.

suggestions:

  • either (a) train own vectors text, makes sense if have quantity of text; or (b) load vectors elsewhere. don't try both. (or, in case of code above, neither.)
  • the update=true option build_vocab() should considered expert, experimental option – worth tinkering if you've had things working in simpler modes, , you're sure need , understand implications.
  • normal use won't ever explicitly re-assign new values word2vec model's syn0 property - managed class's training routines, never need 0 them out or modify them. should tally own text summary vectors, based on word-vectors, outside model in own data structures.

Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -