beautifulsoup - How to retain " and ' while parsing xml using bs4 python -


i using bs4 parse xml file , again write new xml file.

input file:

<tag1>   <tag2 attr1="a1">&quot; example text &quot;</tag2>   <tag3>     <tag4 attr2="a2">&quot; example text &quot;</tag4>     <tag5>       <tag6 attr3="a3">&apos; example text &apos;</tag6>     </tag5>   </tag3> </tag1> 

script:

soup = beautifulsoup(open("input.xml"), "xml") f = open("output.xml", "w")  f.write(soup.encode(formatter='minimal')) f.close() 

output:

<tag1>   <tag2 attr1="a1"> " example text "  </tag2>   <tag3>     <tag4 attr2="a2"> " example text " </tag4>     <tag5>       <tag6 attr3="a3"> ' example text ' </tag6>     </tag5>   </tag3> </tag1> 

i want retain &quot; , &apos; . tried using options of encode formatter - minimal, xml, html, none. none of them solved problem.

then tried replacing " &quot; manually.

for tag in soup.find_all(text=re.compile("\"")):     res = tag.string     res1 = res.replace("\"","&quot;")     tag.string.replacewith(res1) 

but gave below output

<tag1>   <tag2 attr1="a1"> &amp;quot; example text &amp;quot;  </tag2>   <tag3>     <tag4 attr2="a2"> &amp;quot; example text &amp;quot; </tag4>     <tag5>       <tag6 attr3="a3"> &apos; example text &apos; </tag6>     </tag5>   </tag3> </tag1> 

it replaces & &amp; . confused here. please me in solving this.

custom encode & output formatting

you can use custom formatter function add these specific entities entity substitution.

from bs4 import beautifulsoup bs4.dammit import entitysubstitution  def custom_formatter(string):     """add &quot; , &apos; entity substitution"""     return entitysubstitution.substitute_html(string).replace('"','&quot;').replace("'",'&apos;')  input_file = '''<tag1>   <tag2 attr1="a1">&quot; example text &quot;</tag2>   <tag3>     <tag4 attr2="a2">&quot; example text &quot;</tag4>     <tag5>       <tag6 attr3="a3">&apos; example text &apos;</tag6>     </tag5>   </tag3> </tag1> '''  soup = beautifulsoup(input_file, "xml")  print soup.encode(formatter=custom_formatter) 

<?xml version="1.0" encoding="utf-8"?> <tag1> <tag2 attr1="a1">&quot; example text &quot;</tag2> <tag3> <tag4 attr2="a2">&quot; example text &quot;</tag4> <tag5> <tag6 attr3="a3">&apos; example text &apos;</tag6> </tag5> </tag3> </tag1> 

the trick after entitysubstitution.substitute_html() &s don't substituted &amp;s.


Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -