I wrote this function findTokenOffset that finds the offset of a given word in a pre-tokenized text (as a list of spaced words or according to a certain tokenizer).
import re, json
def word_regex_ascii(word):
    return r"\b{}\b".format(re.escape(word))
def findTokenOffset(text,tokens):
  seen = {} # map if a token has been see already!
  items=[] # word tokens
  my_regex = word_regex_ascii
  # for each token word
  for index_word,word in enumerate(tokens):
      r = re.compile(my_regex(word), flags=re.I | re.X | re.UNICODE)
      item = {}
      # for each matched token in sentence
      for m in r.finditer(text):
          token=m.group()
          characterOffsetBegin=m.start()
          characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
          
          found=-1
          if word in seen:
              found=seen[word]
          
          if characterOffsetBegin > found:
              # store last word has been seen
              seen[word] = characterOffsetEnd
              item['index']=index_word+1 #// word index starts from 1
              item['word']=token
              item['characterOffsetBegin'] = characterOffsetBegin
              item['characterOffsetEnd'] = characterOffsetEnd
              items.append(item)
              break
  return items
This code works ok when the tokens are single words like
text = "George Washington came to Washington"
tokens = text.split()
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2)) 
But, supposed to have tokens having a multi-token fashion like here:
text = "George Washington came to Washington"
tokens = ["George Washington", "Washington"]
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2)) 
the offset does not work properly, due to repeating words in different tokens:
[
  {
    "index": 1,
    "word": "George Washington",
    "characterOffsetBegin": 0,
    "characterOffsetEnd": 16
  },
  {
    "index": 2,
    "word": "Washington",
    "characterOffsetBegin": 7,
    "characterOffsetEnd": 16
  }
]
How to add support to multi-token and overlapped token regex matching (thanks to the suggestion in comments for this exact problem's name)?
 
     
    