I'm trying to parse an xml file using beautifulsoup4. IDE : LICLIPSE Python version: 2.7 xml encoding : utf-8
Sample xml file : http://pastebin.com/RhjvyKDN
Below is the code I used to parse the xml files and write the extracted information to a local mysql database.
from bs4 import BeautifulSoup
import pymysql
import os, os.path
#strips apostrophes from the text and then just adds them at the beginning and end for the query
def apostro(text):
    text= text.replace("'","")
    text= text.replace(",","")
    text = "'"+text+"'"
    return text
#sets up the MYSQL connection
conn = pymysql.connect(host='127.0.0.1',  user='xxxx', passwd='xxxx', db='mysql', port= 3306 )
cur = conn.cursor()
#drop all of the previous values from the database
cur.execute("DELETE FROM db WHERE title is not null")
conn.commit()
#loop through all of the files
for root, _, files in os.walk("C:/usc/xml"):
    for f in files:
        #j is a counter for how many sections we have processed
        j=0
        #fullpath is the location of the file we're parsing
        fullpath = os.path.join(root, f)
        print(fullpath)
        #open file using BeautifulSoup
        soup = BeautifulSoup(open(""+fullpath+""), 'xml')
        sec = soup.find_all("section", {"style" : "-uslm-lc:I80"})
        t = soup.main.title
        t_num = t.num['value']
        #if not clauses are needed in case there is a blank, otherwise an error is thrown
        if not t.heading.text:
            t_head = ''
        else:
            t_head = t.heading.text.encode('ascii', 'ignore').encode("UTF-8")
        for element in sec:
            if not element.num['value']:
                section = ''
            else:
                section = element.num['value'].encode('ascii', 'ignore').encode("UTF-8")
            if not element.heading:
                s_head = ''
            else:
                s_head = element.heading.text.encode('ascii', 'ignore').encode("UTF-8")
            if not element.text:
                s_text = ''
            else:
                s_text = element.text.encode('ascii', 'ignore').encode("UTF-8")
            #inserttest is the sql command that 'cur' executes. counter is printed every time a section is written to let me know the program is still alive
            inserttest = "insert into deadlaws.usc_new (title, t_head, section, s_head, s_text) values (" + t_num + "," + apostro(t_head) + "," + apostro(section) + "," + apostro(s_head) + "," + apostro(s_text) +")"
            j=j+1
            cur.execute( inserttest)
            conn.commit()
            print(fullpath + " " +str(j))
conn.commit()
cur.close()
conn.close()
Everything went well until I noticed that the program ignores the hyphens '-' in the section numbers which makes the entire activity wrong.
I know I have used 'ignore' in the encode statement, but a hyphen '-' is a legitimate character in ascii, right? Shouldn't it be writing the character to the db instead of ignoring it?
I did a lot of reading on SO and elsewhere.
I've tried including from_encoding="utf-8" in the soup statement, 'xmlrefreplace' in the encode() statement and other methods, which have resulted in the below output : it writes this a– (some special unicode character) instead of a hyphen '-' to the database.
Sample output: 
The data is huge and I'm afraid there could be other characters like - that are being ignored by the program. It's ok if it ignores special characters from the t_head, s_head and s_text fields as they are text but not the section column.
Any help in resolving this issue would be greatly appreciated.
 
     
    