python encoding ' (quotation) in xml file -
i have xml file encoding information not specified. trying read , write file in location using below method
import xml.etree.elementtree et import pandas pd lxml import etree,html lxml.html.clean import cleaner,clean_html xml.sax.saxutils import escape, unescape, quoteattr open('check1.xml', 'rb') file: xml_file = file.read() tree = html.fromstring(xml_file) tree1 = etree.elementtree(tree) tree1.write('path xml file', pretty_print=true, xml_declaration=true,encoding = 'utf-8')
input:
<unit> <source>site name: investigation's address</source> <target></target> </unit>
output:
<unit> <source>site name: investigationâsaddress </source> <target/> </unit>
why these characters showing , why '
not displayed properly.i tried latin-1 encoding facing similar issue except '
different characters displayed.
don't use open()
read xml files. wrong thing do.
xml parsers have own file handling. elementtree
not exception. use et.parse()
read files , et.write()
—or tree.write()
—to write them.
import xml.etree.elementtree et tree = et.parse('check1.xml') tree.write('path xml file', pretty_print=true, xml_declaration=true, encoding='utf-8')
this simple parse-write cycle fix messed-up line endings, since \r\n
is not proper line ending in xml; converted \n
automatically.
background
in virtually cases, file handling functions in xml parsers deal file encodings. opening files , reading them strings breaks automatic handling, i.e. doing manually bug waiting happen.
if xml file missing xml declaration (<?xml version="1.0" encoding="..." ?>
) assumed utf-8. if such file isn't utf-8 reason, it's, strictly speaking, broken.
xml-aware tools not create such files. if have such files, checking how created , fixing process should first priority.
if that's not option, trying fix such broken file only situation reading file string , giving string xml parser right solution. however, requires prior knowledge of file encoding, thing don't need bother when using et.parse()
.
assuming file in windows code page 1252, erroneously misses xml declaration, , want fix writing encoded version:
import xml.etree.elementtree et open('check1.xml', encoding="cp1252") f: tree = et.fromstring(f.read()) tree.write('path xml file', pretty_print=true, xml_declaration=true, encoding='utf-8')
unless in specific situation, use et.parse()
read xml files.
Comments
Post a Comment