python - Sequence words with regex -
i search sequence:
nunca[adv+neg+circ] más[adv+comp+circ] compraré[v+h_predicat_action]
and
nunca más compraré
my script:
corpus = "me[unknown] temo[unknown] que[unknown] buscare[unknown] otras[unknown] opciones[unknown] esta[unknown] nunca[adv+neg+circ] más[adv+comp+padv+h_circonstant_quantite] compraré[v+h_predicat_action]" part1 = re.findall(r"(\w+)\[adv\+neg.*?\]", corpus) part2 = re.findall(r"(\w+)\[adv+comp+padv.*?\]", corpus) part3 = re.findall(r"(\w+)\[v\+h_predicat.*?\]", corpus) print(part1 + part2 + part3)
result:
[]
if searched substrings in arbitrary order - use following: re.findall()
approach:
corpus = "me[unknown] temo[unknown] que[unknown] buscare[unknown] \ otras[unknown] opciones[unknown] esta[unknown] nunca[adv+neg+circ] \ más[adv+comp+padv+h_circonstant_quantite] compraré[v+h_predicat_action]" result = ' '.join(i[0] in re.findall(r'(\w+)\[[^][]*(ad|v)\+[^][]*\]', corpus, re.m | re.unicode)) print(result)
the output:
nunca más compraré
regex pattern explanation:
(\w+)
- match word(alphanumeric sequence) (for ex.nunca
). placed first captured group(...)
\[
- match opening square bracket[
literally[^][]*
- match 1 or many characters except square brackets][
(ad|v)
- alternation group, match eitherad
orv
key\]
- match closing square bracket]
literally
for ex. \[[^][]*(ad|v)\+[^][]*\]
match [adv+neg+circ]
----------
if order of sequences strict - use re.sub()
function instead re.findall()
remove parenthetical sequences:
corpus = "me[unknown] temo[unknown] que[unknown] buscare[unknown] \ otras[unknown] opciones[unknown] esta[unknown] nunca[adv+neg+circ] \ más[adv+comp+padv+h_circonstant_quantite] compraré[v+h_predicat_action]" result = re.sub(r'\[[^][]+\]', '', corpus, re.m | re.unicode) print(result)
the output:
me temo que buscare otras opciones esta nunca más compraré
to extract last 3 words:
print(' '.join(result.split()[-3:])) # nunca más compraré
Comments
Post a Comment