I have a folder that consists of various 10 docx files. I am trying to create a corpus, which should be a list of length 10. Each element of the list should refer to the text of each docx document.
I have following function to extract text from docx files:
            import os
            from nltk.corpus.reader.plaintext import PlaintextCorpusReader
            import glob 
            from docx import *
            def getText(filename):
                document = Document(filename)
                newparatextlist = []
                for paragraph in document.paragraphs:
                    newparatextlist.append(paragraph.text.strip().encode("utf-8")) 
                return newparatextlist
            path = 'pat_to_folder/*.docx'   
            files=glob.glob(path)  
            corpus_list = []
            for f in files:
                cur_corpus = getText(f)
                corpus_list.append(cur_corpus)
            corpus_list[0] 
However, if I have content as follows in my word documents: http://www.actus-usa.com/sampleresume.doc https://www.myinterfase.com/sjfc/resources/resource_view.aspx?resource_id=53
the above function creates a list of list. How can I simply create a corpus out of the files?
TIA!