Whilst searching for a text classification method, I came across this Python code which was used in the pre-processing step
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
    """
        text: a string 
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = text.replace('x', '')
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
    return text
OP
I then tested this section of code to understand the syntax and its purpose
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
text = '[0a;m]'
BAD_SYMBOLS_RE.sub(' ', text)
# returns ' 0a m ' whilst I thought it would return '   ;  '
Question: why didn't the code replace 0, a, and m although 0-9a-z was specified inside the [ ]? Why did it replace ; although that character wasn't specified?
Edit to avoid being marked as duplication:
My perceptions of the code are:
- The line BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')is confusing. Including the characters#,+, and_inside the[ ]made me think the line trying to remove the characters in the list (because no word in an English dictionary would contain those bad characters#+_, I believe?). Consequently, it made me interpret the^as the start of a string (instead of negation). Thus, the original post (which was kindly answered by Tim Pietzcker and Raymond Hettinger). The two linesREPLACE_BY_SPACE_REandBAD_SYMBOLS_REshould had been combined into one such as
REMOVE_PUNCT = re.compile('[^0-9a-z]')
text = REMOVE_PUNCT.sub('', text)
- I also think the code text = text.replace('x', '')(which was meant to remove the IDs that were masked as XXX-XXXX.... in the raw data) will lead to bad outcome, for example the wordnextwill becomenet.
Additional questions:
- Are my perceptions reasonable? 
- Should numbers/digits be removed from text? 
- Could you please recommend an overall/general strategy/code for text pre-processing for (English) text classification? 
 
     
    