We have quite a lot of text (mostly written in English) which was incorrectly imported (from a source we have no control over). For example
configuredincorrectly- into the 2 wordsconfigured&incorrectlyRegardsJohn Doe- into a wordRegardsand a named entityJohn DoeTo: person1@example.comCC:addr2@example.co.ukBCC:person3@example.sg- into 3 tuples(To,person1@example.com),(CC,addr2@example.co.uk),(BCC,person3@example.sg)problem.Possible- into the 2 wordsproblem&possible
I acknowledge that we are trying to address multiple problems here. It is tempting to write non-scalable code such as
- regular expressions each time we try to solve a particular dirty text scenario,
- string.replace(keyword,keywordwithSpace)
Could anyone please point me towards a (partial) solution for problems 1 & 2?
A solution which made use of natural language understanding would be most ideal.
We have ~ 1000 words in our vocabulary, such as [communication, database, hardware, network, problem, rectify, solution, etc.]. Is there a way we can "train" a model to recognize that words like hardwarefailure really mean 2 separate words hardware & failure.
Many thanks in advance!