Suppose I have the sentence - "Jane is running"
And another list of sentences -
["Jane is a girl",
"Jane can run",
"Run a race",
"Sitting down on sofa",
"Sitting down on a chair",
"Sitting on a bench",
"Climbing a tree",
"Climbing a rock",
"Run to reach somewhere"]
Now my goal is, given the first sentence, which sentences does it match to.
The output needs to be something like -
"Jane is running" : "Jane can run", "Jane is a girl", "Run a race", "Run to reach somewhere"
Kindly take a note of the order of the output, in case of "Jane can run" there are two matches, Jane and run, while the rest have either matched with Jane or run.
As for the main sentence, the words could have been in this case, ran, run, running, Jan, Janet, June, i.e. spelling errors and variations of words need to be considered.
The algorithm that I came up with is -
- Divide the main sentence into a list of words -
["Jane", "is", "running"]. - Do the same for each sentence in the list.
- For every word in the main sentence check for matches in every word of every sentence in the list of sentences keeping an edit distance of 5 or 6.
- Group the sentences that match and sort them according to the maximum number of matches
This method feels a very brute-force approach to the problem. How can I improve this ?