beautifulsoup - How to retain " and ' while parsing xml using bs4 python -
i using bs4 parse xml file , again write new xml file.
input file:
<tag1> <tag2 attr1="a1">" example text "</tag2> <tag3> <tag4 attr2="a2">" example text "</tag4> <tag5> <tag6 attr3="a3">' example text '</tag6> </tag5> </tag3> </tag1>
script:
soup = beautifulsoup(open("input.xml"), "xml") f = open("output.xml", "w") f.write(soup.encode(formatter='minimal')) f.close()
output:
<tag1> <tag2 attr1="a1"> " example text " </tag2> <tag3> <tag4 attr2="a2"> " example text " </tag4> <tag5> <tag6 attr3="a3"> ' example text ' </tag6> </tag5> </tag3> </tag1>
i want retain "
, '
. tried using options of encode formatter - minimal, xml, html, none. none of them solved problem.
then tried replacing " "
manually.
for tag in soup.find_all(text=re.compile("\"")): res = tag.string res1 = res.replace("\"",""") tag.string.replacewith(res1)
but gave below output
<tag1> <tag2 attr1="a1"> &quot; example text &quot; </tag2> <tag3> <tag4 attr2="a2"> &quot; example text &quot; </tag4> <tag5> <tag6 attr3="a3"> ' example text ' </tag6> </tag5> </tag3> </tag1>
it replaces & &
. confused here. please me in solving this.
custom encode & output formatting
you can use custom formatter function add these specific entities entity substitution.
from bs4 import beautifulsoup bs4.dammit import entitysubstitution def custom_formatter(string): """add " , ' entity substitution""" return entitysubstitution.substitute_html(string).replace('"','"').replace("'",''') input_file = '''<tag1> <tag2 attr1="a1">" example text "</tag2> <tag3> <tag4 attr2="a2">" example text "</tag4> <tag5> <tag6 attr3="a3">' example text '</tag6> </tag5> </tag3> </tag1> ''' soup = beautifulsoup(input_file, "xml") print soup.encode(formatter=custom_formatter)
<?xml version="1.0" encoding="utf-8"?> <tag1> <tag2 attr1="a1">" example text "</tag2> <tag3> <tag4 attr2="a2">" example text "</tag4> <tag5> <tag6 attr3="a3">' example text '</tag6> </tag5> </tag3> </tag1>
the trick after entitysubstitution.substitute_html()
&
s don't substituted &
s.
Comments
Post a Comment