I am working on a large file (over 2 million rows) in which I would like to remove all titles and suffixes (personal and/or professional) from each of the strings. As you will see from the small test case below, the titles and suffixes appear at different positions with each string.
I have used parts of answers from the following 3 questions:
Negative lookahead on Regex Pattern
regular expression for exact match of a word
How to search for multiple strings and replace them with nothing within a list of strings
test <- c("pan-chr ii", "true ii.", "mr. and mrs panjii", "pans iv prof",
"md trs iv.", "iipan", "a c iii miss clark", "a c iv jones mrs",
"a c jones iv", "a c jr huffman phd.", "a c jr markkula",
"a c sr. goldtrap", "mr & mrs prof dr. a c cjdr iv, esq.",
"false mr petty phd", "abe jr esquibel phd",
"md reginald r dr esquire garcia", "laurence curry, md",
"lawrence mcdonald md phd", "mdonald mr and mrs sebelmd dr jr md phd",
 "(van) der walls")
# test
# [1] "pan-chr ii"                                   
# [2] "true ii."                                     
# [3] "mr. and mrs panjii"                           
# [4] "pans iv prof"                                 
# [5] "md trs iv."                                   
# [6] "iipan"                                        
# [7] "a c iii miss clark"                           
# [8] "a c iv jones mrs"                             
# [9] "a c jones iv"                                 
# [10] "a c jr huffman phd."                          
# [11] "a c jr markkula"                              
# [12] "a c sr. goldtrap"                             
# [13] "mr & mrs prof dr. a c cjdr iv, esq."          
# [14] "false mr petty phd"                           
# [15] "abe jr esquibel phd"                          
# [16] "md reginald r dr esquire garcia"              
# [17] "laurence curry, md"                           
# [18] "lawrence mcdonald md phd"                     
# [19] "mdonald mr and mrs sebelmd dr jr md phd"
# [20] "(van) der walls"
testresult <- gsub(",? *(mister|sir|madam|mr\\.|mr|mrs\\.|mrs|ms\\.|
mr\\. and mrs\\.|mr and mrs|mr\\. and mrs|mr and mrs\\.|
mr\\. & mrs\\.|mr & mrs|mr\\. & mrs|mr & mrs\\.|& mrs\\.|and mrs\\.|
and mrs\\.|& mrs|and mrs|ms|miss\\.|miss|prof\\.|prof|professor|
doctor|md|md\\.|m\\.d\\.|dr\\.|dr|phd|phd\\.|esq\\.|esq|esquire|
i{2,3}|i{2,3}\\.|iv|iv\\.|jr|jr\\.|sr|sr\\.|\\(|\\))(?![\\w\\d])", "",
test, perl = TRUE)
# testresult
# [1] "pan-chr"                    "true."                     
# [3] " panj"                      "pans"                      
# [5] " trs."                      "iipan"                     
# [7] "a c clark"                  "a c jones"                 
# [9] "a c jones"                  "a c huffman."              
# [11] "a c markkula"               "a c. goldtrap"             
# [13] " a c cj"                    "false petty"               
# [15] "abe esquibel"               " reginald r garcia"
# [17] "laurence curry"             "lawrence mcdonald"         
# [19] "mdonald sebel"              "(van der walls"
1) How should the regular expression expressed in testresult be revised to achieve the following result?
2) Is there a faster option than using gsub since I have a file with > 2 million rows?
Thank you.
# testresult that I want to have
# [1] "pan-chr"                       "true"                        
# [3] "panjii"                        "pans"                         
# [5] "trs"                           "iipan"                        
# [7] "a c clark"                     "a c jones"                    
# [9] "a c jones"                     "a c huffman"                 
# [11] "a c markkula"                 "a c goldtrap"                
# [13] "a c cjdr"                     "false petty"                  
# [15] "abe esquibel"                 "reginald r garcia"
# [17] "laurence curry"               "lawrence mcdonald"         
# [19] "mdonald sebelmd"              "van der walls"  
 
     
    