python - Sentence matching with gensim word2vec: manually populated model doesn't work -
i'm trying solve problem of sentence comparison using naive approach of summing word vectors , comparing results. goal match people interest, dataset consists of names , short sentences describing hobbies. batches small, few hundreds of people, wanted give try before digging doc2vec.
i prepare data cleaning completely, removing stop words, tokenizing , lemmatizing. use pre-trained model word vectors returns adequate results when finding similarities test words. tried summing sentence words find similarities in original model - matches make sense. similarities around general sense of phrase.
for sentence matching i'm trying following: create empty model
b = gs.models.word2vec(min_count=1, size=300, sample=0, hs=0)
build vocab out of names (or person id's), no training
#first create vocab empty vector test = [['test']] b.build_vocab(test) b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0 #populate vocab array b.build_vocab([personids], update=true)
summ each sentence's word vectors , store results model corresponding id
#sentences pulled pandas dataset df. 'a' pre-trained model use vectors each word def summ(phrase, start_model): ''' vector addition function ''' #starting vector of 0's sum_vec = start_model.word_vec("cat_noun")*0 word in phrase: sum_vec += start_model.word_vec(word) return sum_vec i, row in df.iterrows(): try: personid = row["id"] summvec = summ(df.iloc[i,1],a) #updating syn0 each name/id in vocabulary b.wv.syn0[b.wv.vocab[personid].index] = summvec except: pass
i understand shouldn't expecting accuracy here, t-sne print doesn't show clustering whatsoever. finding similarities method fails find matches (<0.2 similarity coefficient everything). [
wondering if has idea of did go wrong? approach valid @ all?
your code, shown, neither train()
of word-vectors (using local text), nor pre-load vectors elsewhere. vectors exist – created build_vocab()
calls – still in randomly-initialized starting locations, , useless semantic purposes.
suggestions:
- either (a) train own vectors text, makes sense if have quantity of text; or (b) load vectors elsewhere. don't try both. (or, in case of code above, neither.)
- the
update=true
optionbuild_vocab()
should considered expert, experimental option – worth tinkering if you've had things working in simpler modes, , you're sure need , understand implications. - normal use won't ever explicitly re-assign new values
word2vec
model'ssyn0
property - managed class's training routines, never need 0 them out or modify them. should tally own text summary vectors, based on word-vectors, outside model in own data structures.
Comments
Post a Comment