Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError? -
i have text file publisher (the securities exchange commission) asserts encoded in utf-8 (https://www.sec.gov/files/aqfs.pdf, section 4). i'm processing lines following code:
def tags(filename): """yield tag instances tag.txt.""" codecs.open(filename, 'r', encoding='utf-8', errors='strict') f: fields = f.readline().strip().split('\t') line in f.readlines(): yield process_tag_record(fields, line) i receive following error:
traceback (most recent call last): file "/home/randm/projects/finance/secxbrl.py", line 151, in <module> main() file "/home/randm/projects/finance/secxbrl.py", line 143, in main all_tags = list(tags("tag.txt")) file "/home/randm/projects/finance/secxbrl.py", line 109, in tags content = f.read() file "/home/randm/libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read return self.reader.read(size) file "/home/randm/libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read newchars, decodedbytes = self.decode(data, self.errors) unicodedecodeerror: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte given can't go sec , tell them have files don't seem encoded in utf-8, how should debug , catch error?
what have tried
i did hexdump of file , found offending text text "supplemental disclosure of non�cash investing". if decode offending byte hex code point (i.e. "u+00ad"), makes sense in context soft hyphen. following not seem work:
python 3.5.2 (default, nov 17 2016, 17:05:23) [gcc 5.4.0 20160609] on linux type "help", "copyright", "credits" or "license" more information. >>> b"\x41".decode("utf-8") 'a' >>> b"\xad".decode("utf-8") traceback (most recent call last): file "<stdin>", line 1, in <module> unicodedecodeerror: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte >>> b"\xc2ad".decode("utf-8") traceback (most recent call last): file "<stdin>", line 1, in <module> unicodedecodeerror: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte i've used errors='replace', seems pass. i'd understand happen if try insert database.
edited add hexdump:
0036ae40 31 09 09 09 09 53 55 50 50 4c 45 4d 45 4e 54 41 |1....supplementa| 0036ae50 4c 20 44 49 53 43 4c 4f 53 55 52 45 20 4f 46 20 |l disclosure of | 0036ae60 4e 4f 4e ad 43 41 53 48 20 49 4e 56 45 53 54 49 |non.cash investi| 0036ae70 4e 47 20 41 4e 44 20 46 49 4e 41 4e 43 49 4e 47 |ng , financing| 0036ae80 20 41 43 54 49 56 49 54 49 45 53 3a 09 0a 50 72 | activities:..pr|
you have corrupted data file. if character meant u+00ad soft hyphen, missing 0xc2 byte:
>>> '\u00ad'.encode('utf8') b'\xc2\xad' of possible utf-8 encodings end in 0xad, soft hyphen make sense. however, indicative of data set may have other bytes missing. happened have hit 1 matters.
i'd go source of dataset , verify file not corrupted when downloaded. otherwise, using error='replace' viable work-around, provided no delimiters (tabs, newlines, etc.) missing.
another possibility sec using different encoding file; example in windows codepage 1252 , latin-1, 0xad correct encoding of soft hyphen. , indeed, when download same dataset directly (warning, large zip file linked), , open tags.txt, can't decode data utf-8:
>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read() traceback (most recent call last): file "<stdin>", line 1, in <module> file "/.../lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) unicodedecodeerror: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte >>> pprint import pprint >>> f = open('/tmp/2017q1/tag.txt', 'rb') >>> f.seek(3583550) 3583550 >>> pprint(f.read(100)) (b'1\t1\t\t\t\tsupplemental disclosure of non\xadcash investing , financing a' b'ctivities:\t\nproceedsfromsaleofin') there 2 such non-ascii characters in file:
>>> f.seek(0) 0 >>> pprint([l l in f if any(b > 127 b in l)]) [b'supplementaldisclosureofnoncashinvestingandfinancingactivitiesabstract\t0' b'001654954-17-000551\t1\t1\t\t\t\tsupplemental disclosure of non\xadcash i' b'nvesting , financing activities:\t\n', b'hotelkranichhhemember\t0001558370-17-001446\t1\t0\tmember\td\t\thotel krani' b'chhhe [member]\trepresents information pertaining hotel kranichh\xf6h' b'e.\n'] hotel kranichh\xf6he decoded latin-1 hotel kranichhöhe.
there several 0xc1 / 0xd1 pairs in file:
>>> f.seek(0) 0 >>> quotes = [l l in f if any(b in {0x1c, 0x1d} b in l)] >>> quotes[0].split(b'\t')[-1][50:130] b'temporary payroll tax cut continuation act of 2011 (\x1ctcca\x1d) recognized during th' >>> quotes[1].split(b'\t')[-1][50:130] b'ributory defined benefit pension plan (the \x1caetna pension plan\x1d) allow certai' i'm betting u+201c left double quotation mark , u+201d right double quotation mark characters; note 1c , 1d parts. feels if encoder took utf-16 , stripped out high bytes, rather encode utf-8 properly!
there no codec shipping python encode '\u201c\u201d' b'\x1c\x1d', making more sec has botched encoding process somewhere. in fact, there 0x13 , 0x14 characters en , em dashes (u+2013 , u+2014), 0x19 bytes single quotes (u+2019). missing complete picture 0x18 byte represent u+2018.
if assume encoding broken, can attempt repair. following code read file , fix quotes issues, assuming rest of data not use characters outside of latin-1 apart quotes:
_map = { # dashes 0x13: '\u2013', 0x14: '\u2014', # single quotes 0x18: '\u2018', 0x19: '\u2019', # double quotes 0x1c: '\u201c', 0x1d: '\u201d', } def repair(line, _map=_map): """repair mis-encoded sec data. assumes line decoded latin-1""" return line.translate(_map) then apply lines read:
with open(filename, 'r', encoding='latin-1') f: repaired = map(repair, f) fields = next(repaired).strip().split('\t') line in repaired: yield process_tag_record(fields, line) separately, addressing posted code, making python work harder needs to. don't use codecs.open(); that's legacy code has known issues , slower newer python 3 i/o layer. use open(). not use f.readlines(); don't need read whole file list here. iterate on file directly:
def tags(filename): """yield tag instances tag.txt.""" open(filename, 'r', encoding='utf-8', errors='strict') f: fields = next(f).strip().split('\t') line in f: yield process_tag_record(fields, line) if process_tag_record splits on tabs, use csv.reader() object , avoid splitting each row manually:
import csv def tags(filename): """yield tag instances tag.txt.""" open(filename, 'r', encoding='utf-8', errors='strict') f: reader = csv.reader(f, delimiter='\t') fields = next(reader) row in reader: yield process_tag_record(fields, row) if process_tag_record combines fields list values in row form dictionary, use csv.dictreader() instead:
def tags(filename): """yield tag instances tag.txt.""" open(filename, 'r', encoding='utf-8', errors='strict') f: reader = csv.dictreader(f, delimiter='\t') # first row used keys dictionary, no need read fields manually. yield reader
Comments
Post a Comment