Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError? -


i have text file publisher (the securities exchange commission) asserts encoded in utf-8 (https://www.sec.gov/files/aqfs.pdf, section 4). i'm processing lines following code:

def tags(filename):     """yield tag instances tag.txt."""     codecs.open(filename, 'r', encoding='utf-8', errors='strict') f:         fields = f.readline().strip().split('\t')         line in f.readlines():             yield process_tag_record(fields, line) 

i receive following error:

traceback (most recent call last):   file "/home/randm/projects/finance/secxbrl.py", line 151, in <module>     main()   file "/home/randm/projects/finance/secxbrl.py", line 143, in main     all_tags = list(tags("tag.txt"))   file "/home/randm/projects/finance/secxbrl.py", line 109, in tags     content = f.read()   file "/home/randm/libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read     return self.reader.read(size)   file "/home/randm/libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read     newchars, decodedbytes = self.decode(data, self.errors) unicodedecodeerror: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte 

given can't go sec , tell them have files don't seem encoded in utf-8, how should debug , catch error?

what have tried

i did hexdump of file , found offending text text "supplemental disclosure of non�cash investing". if decode offending byte hex code point (i.e. "u+00ad"), makes sense in context soft hyphen. following not seem work:

python 3.5.2 (default, nov 17 2016, 17:05:23)  [gcc 5.4.0 20160609] on linux type "help", "copyright", "credits" or "license" more information. >>> b"\x41".decode("utf-8") 'a' >>> b"\xad".decode("utf-8") traceback (most recent call last):   file "<stdin>", line 1, in <module> unicodedecodeerror: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte >>> b"\xc2ad".decode("utf-8") traceback (most recent call last):   file "<stdin>", line 1, in <module> unicodedecodeerror: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte 

i've used errors='replace', seems pass. i'd understand happen if try insert database.

edited add hexdump:

0036ae40  31 09 09 09 09 53 55 50  50 4c 45 4d 45 4e 54 41  |1....supplementa| 0036ae50  4c 20 44 49 53 43 4c 4f  53 55 52 45 20 4f 46 20  |l disclosure of | 0036ae60  4e 4f 4e ad 43 41 53 48  20 49 4e 56 45 53 54 49  |non.cash investi| 0036ae70  4e 47 20 41 4e 44 20 46  49 4e 41 4e 43 49 4e 47  |ng , financing| 0036ae80  20 41 43 54 49 56 49 54  49 45 53 3a 09 0a 50 72  | activities:..pr| 

you have corrupted data file. if character meant u+00ad soft hyphen, missing 0xc2 byte:

>>> '\u00ad'.encode('utf8') b'\xc2\xad' 

of possible utf-8 encodings end in 0xad, soft hyphen make sense. however, indicative of data set may have other bytes missing. happened have hit 1 matters.

i'd go source of dataset , verify file not corrupted when downloaded. otherwise, using error='replace' viable work-around, provided no delimiters (tabs, newlines, etc.) missing.

another possibility sec using different encoding file; example in windows codepage 1252 , latin-1, 0xad correct encoding of soft hyphen. , indeed, when download same dataset directly (warning, large zip file linked), , open tags.txt, can't decode data utf-8:

>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read() traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "/.../lib/python3.6/codecs.py", line 321, in decode     (result, consumed) = self._buffer_decode(data, self.errors, final) unicodedecodeerror: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte >>> pprint import pprint >>> f = open('/tmp/2017q1/tag.txt', 'rb') >>> f.seek(3583550) 3583550 >>> pprint(f.read(100)) (b'1\t1\t\t\t\tsupplemental disclosure of non\xadcash investing , financing a'  b'ctivities:\t\nproceedsfromsaleofin') 

there 2 such non-ascii characters in file:

>>> f.seek(0) 0 >>> pprint([l l in f if any(b > 127 b in l)]) [b'supplementaldisclosureofnoncashinvestingandfinancingactivitiesabstract\t0'  b'001654954-17-000551\t1\t1\t\t\t\tsupplemental disclosure of non\xadcash i'  b'nvesting , financing activities:\t\n',  b'hotelkranichhhemember\t0001558370-17-001446\t1\t0\tmember\td\t\thotel krani'  b'chhhe [member]\trepresents information pertaining hotel kranichh\xf6h'  b'e.\n'] 

hotel kranichh\xf6he decoded latin-1 hotel kranichhöhe.

there several 0xc1 / 0xd1 pairs in file:

>>> f.seek(0) 0 >>> quotes = [l l in f if any(b in {0x1c, 0x1d} b in l)] >>> quotes[0].split(b'\t')[-1][50:130] b'temporary payroll tax cut continuation act of 2011 (\x1ctcca\x1d) recognized during th' >>> quotes[1].split(b'\t')[-1][50:130] b'ributory defined benefit pension plan (the \x1caetna pension plan\x1d) allow certai' 

i'm betting u+201c left double quotation mark , u+201d right double quotation mark characters; note 1c , 1d parts. feels if encoder took utf-16 , stripped out high bytes, rather encode utf-8 properly!

there no codec shipping python encode '\u201c\u201d' b'\x1c\x1d', making more sec has botched encoding process somewhere. in fact, there 0x13 , 0x14 characters en , em dashes (u+2013 , u+2014), 0x19 bytes single quotes (u+2019). missing complete picture 0x18 byte represent u+2018.

if assume encoding broken, can attempt repair. following code read file , fix quotes issues, assuming rest of data not use characters outside of latin-1 apart quotes:

_map = {     # dashes     0x13: '\u2013', 0x14: '\u2014',     # single quotes     0x18: '\u2018', 0x19: '\u2019',     # double quotes     0x1c: '\u201c', 0x1d: '\u201d', } def repair(line, _map=_map):     """repair mis-encoded sec data. assumes line decoded latin-1"""     return line.translate(_map) 

then apply lines read:

with open(filename, 'r', encoding='latin-1') f:     repaired = map(repair, f)     fields = next(repaired).strip().split('\t')     line in repaired:         yield process_tag_record(fields, line) 

separately, addressing posted code, making python work harder needs to. don't use codecs.open(); that's legacy code has known issues , slower newer python 3 i/o layer. use open(). not use f.readlines(); don't need read whole file list here. iterate on file directly:

def tags(filename):     """yield tag instances tag.txt."""     open(filename, 'r', encoding='utf-8', errors='strict') f:         fields = next(f).strip().split('\t')         line in f:             yield process_tag_record(fields, line) 

if process_tag_record splits on tabs, use csv.reader() object , avoid splitting each row manually:

import csv  def tags(filename):     """yield tag instances tag.txt."""     open(filename, 'r', encoding='utf-8', errors='strict') f:         reader = csv.reader(f, delimiter='\t')         fields = next(reader)         row in reader:             yield process_tag_record(fields, row) 

if process_tag_record combines fields list values in row form dictionary, use csv.dictreader() instead:

def tags(filename):     """yield tag instances tag.txt."""     open(filename, 'r', encoding='utf-8', errors='strict') f:         reader = csv.dictreader(f, delimiter='\t')         # first row used keys dictionary, no need read fields manually.         yield reader 

Comments

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

minify - Minimizing css files -