I have a string that looks like:
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
I want to return a new string with certain words removed, only if they are not preceded by certain other words.
For example, the words I want to remove are:
c_out = ["avon", "powys", "somerset","hampshire"]
Only if they do not follow:
c_except = ["on\s","dinas\s"]
Note: There could be multiple instances of words within c_out, and multiple instances of words within c_except.
Individually I tried for 'on\s':
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
regexp1 = re.compile(r'(?<!on\s)(avon|powys|somerset|hampshire)')
print("1st Result: ", regexp1.sub('', phrase))
1st Result:  '5  road bradford on avon avon dinas   north'
This correctly ignores the first 'avon', as it is preceded by 'on\s', it correctly removes the third 'avon', but it ignores the second 'avon' (which it does not remove).
In the same way, for 'dinas\s':
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
regexp2 = re.compile(r'(?<!dinas\s)(avon|powys|somerset|hampshire)')
print("2nd Result: ", regexp2.sub('', phrase))
2nd Result:  '5  road bradford on   dinas powys  north '
This correctly ignores the first 'powys' and removes the second (note the double space between '... powys  north'.
I tried to combine the two expressions by doing:
regexp3 = re.compile(r'((?!on\s)|(?!dinas\s))(avon|powys|somerset|hampshire)')
print("3rd Result: ", regexp3.sub('', phrase))
3rd Result:  5  road bradford on   dinas   north
This incorrectly removed every word, and completely ignored 'on\s' or 'dinas\s'.
Then I tried:
regexp4 = re.compile(r'(?<!on\s|dinas\s)(avon|powys|somerset|hampshire)')
print("4th Result: ", regexp4.sub('', phrase))
And got:
error: look-behind requires fixed-width pattern
I want to end up with:
Result: '5  road bradford on avon dinas powys  north     '
I have had a look at:
Why is this not a fixed width pattern? Python Regex Engine - "look-behind requires fixed-width pattern" Error regex: string with optional parts
But to no avail.
What am I doing wrong?
From comments:
regexp5 = re.compile(r'(?<!on\s)(?<!dinas\s)(avon|powys|somerset|hampshire)')
print("5th Result: ", regexp5.sub('', phrase))
5th Result:  5  road bradford on avon avon dinas powys  north 
Again this misses the second avon.
 
     
    