unicode - extract all possible emoticons from a python list -
objective
i trying extract possible emoticons unicode word list. using python3 anaconda installation, therefore can not use package such emoji.py.
here sample bow of word list.
lst = ['✅','türkçe','Çile','ısp','İst','ğ','some','#','@','@one','#thing','','1','41','ç','ö','⏱','⏱','👏','₺','€',':)',':/'] expected output this:
out = ['✅','⏱', '⏱','👏'] attempt 1
list comprehension check if chars ascii:
[w w in lst if len(w) != len(w.encode())] however, not giving desired output because there non ascii letters in text. also, currency symbols not emoticons.
['✅', 'türkçe', 'Çile', 'ısp', 'İst', 'ğ', 'ç', 'ö', '⏱', '⏱', '👏', '₺', '€'] attempt 2
using ntlk emoticons regular expression
from nltk.tokenize.casual import emoticon_re emoticon_re.findall(' '.join(lst)) however, emoticon_re can extract expressions such :) :/ :(
here list of considering emoticons.
i tried build list of emoticons see if word exists in list, not build list of emoticons unicode character codes.
can please suggest?
i think of characters in symbol, other category. therefore can do
[w w in lst if any(c c in w if unicodedata.category(c) == 'so')]
Comments
Post a Comment