python - Sequence words with regex -
i search sequence:
nunca[adv+neg+circ] más[adv+comp+circ] compraré[v+h_predicat_action]
and
nunca más compraré
my script:
corpus = "me[unknown] temo[unknown] que[unknown] buscare[unknown] otras[unknown] opciones[unknown] esta[unknown] nunca[adv+neg+circ] más[adv+comp+padv+h_circonstant_quantite] compraré[v+h_predicat_action]" part1 = re.findall(r"(\w+)\[adv\+neg.*?\]", corpus) part2 = re.findall(r"(\w+)\[adv+comp+padv.*?\]", corpus) part3 = re.findall(r"(\w+)\[v\+h_predicat.*?\]", corpus) print(part1 + part2 + part3) result:
[]
if searched substrings in arbitrary order - use following: re.findall() approach:
corpus = "me[unknown] temo[unknown] que[unknown] buscare[unknown] \ otras[unknown] opciones[unknown] esta[unknown] nunca[adv+neg+circ] \ más[adv+comp+padv+h_circonstant_quantite] compraré[v+h_predicat_action]" result = ' '.join(i[0] in re.findall(r'(\w+)\[[^][]*(ad|v)\+[^][]*\]', corpus, re.m | re.unicode)) print(result) the output:
nunca más compraré regex pattern explanation:
(\w+)- match word(alphanumeric sequence) (for ex.nunca). placed first captured group(...)\[- match opening square bracket[literally[^][]*- match 1 or many characters except square brackets][(ad|v)- alternation group, match eitheradorvkey\]- match closing square bracket]literally
for ex. \[[^][]*(ad|v)\+[^][]*\] match [adv+neg+circ]
----------
if order of sequences strict - use re.sub() function instead re.findall() remove parenthetical sequences:
corpus = "me[unknown] temo[unknown] que[unknown] buscare[unknown] \ otras[unknown] opciones[unknown] esta[unknown] nunca[adv+neg+circ] \ más[adv+comp+padv+h_circonstant_quantite] compraré[v+h_predicat_action]" result = re.sub(r'\[[^][]+\]', '', corpus, re.m | re.unicode) print(result) the output:
me temo que buscare otras opciones esta nunca más compraré to extract last 3 words:
print(' '.join(result.split()[-3:])) # nunca más compraré
Comments
Post a Comment