python encoding ' (quotation) in xml file -


i have xml file encoding information not specified. trying read , write file in location using below method

import xml.etree.elementtree et import pandas pd lxml import etree,html lxml.html.clean import cleaner,clean_html xml.sax.saxutils import escape, unescape, quoteattr open('check1.xml', 'rb') file:         xml_file = file.read() tree = html.fromstring(xml_file) tree1 = etree.elementtree(tree) tree1.write('path xml file', pretty_print=true, xml_declaration=true,encoding = 'utf-8') 

input:

<unit>  <source>site name:  investigation's address</source>     <target></target> </unit> 

output:

<unit>&#13;  <source>site name: investigationâsaddress </source>&#13;     <target/>&#13; </unit>&#13; 

why these characters showing , why ' not displayed properly.i tried latin-1 encoding facing similar issue except ' different characters displayed.

don't use open() read xml files. wrong thing do.

xml parsers have own file handling. elementtree not exception. use et.parse() read files , et.write()—or tree.write()—to write them.

import xml.etree.elementtree et  tree = et.parse('check1.xml') tree.write('path xml file', pretty_print=true, xml_declaration=true, encoding='utf-8') 

this simple parse-write cycle fix messed-up line endings, since \r\n is not proper line ending in xml; converted \n automatically.


background

in virtually cases, file handling functions in xml parsers deal file encodings. opening files , reading them strings breaks automatic handling, i.e. doing manually bug waiting happen.

if xml file missing xml declaration (<?xml version="1.0" encoding="..." ?>) assumed utf-8. if such file isn't utf-8 reason, it's, strictly speaking, broken.

xml-aware tools not create such files. if have such files, checking how created , fixing process should first priority.

if that's not option, trying fix such broken file only situation reading file string , giving string xml parser right solution. however, requires prior knowledge of file encoding, thing don't need bother when using et.parse().

assuming file in windows code page 1252, erroneously misses xml declaration, , want fix writing encoded version:

import xml.etree.elementtree et  open('check1.xml', encoding="cp1252") f:     tree = et.fromstring(f.read())  tree.write('path xml file', pretty_print=true, xml_declaration=true, encoding='utf-8') 

unless in specific situation, use et.parse() read xml files.


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

Add a dynamic header in angular 2 http provider -

minify - Minimizing css files -