All,
I have some text that I need to clean up and I have a little algorithm that "mostly" works.
def removeStopwords(self, data):
    with open(r'stopwords.txt') as stopwords:
        wordList = []
        for i in stopwords:
            wordList.append(i.strip())
        charList = list(data)
        cat = ''.join(char for char in charList if not char in wordList).split()
        return ' '.join(cat)
Take the first line on this page. http://en.wikipedia.org/wiki/Paragraph and remove all the characters that we are not interested in which in this case are all the non-alphanumeric chars.
A paragraph (from the Greek paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.[1][2] The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented. At various times, the beginning of a paragraph has been indicated by the pilcrow: ¶.
The output looks pretty good except that some of the words are recombined incorrectly and I am unsure how to correct it.
A paragraph from the Greek paragraphos to write beside or written beside is a selfcontained unit
Note the word "selfcontained" was "self-contained".
EDIT: Contents of the stopwords file which is just a bunch of chars.
! $ % ^ , & * ( ) { } [ ] <
, . / | \ ? ~ ` : ; "
Turns out I don't need a list of words at all because I was only really trying to remove characters which in this case were punctuation marks.
        cat = ''.join(data.translate(None, string.punctuation)).split()
        print ' '.join(cat).lower()
 
     
     
     
     
    