I'm writing a simple OCR checker in Excel that parses the OCR output text file into words and uses Wiktionary to check a selection of words to see if they are valid words.
I know there are sophisticated dictionary lookup systems that run in Python, but I'm trying to get this done without needing to get into Python. So I'm using Excel and Wiktionary for a simple approach.
I have a VBA function called vHttpRequest() that accesses a URL and can return the status returned by doing that. For example, if the word is "apple", then I run:
vHttpRequest("https://en.wiktionary.org/wiki/apple", , "status")
Which gives me status 200, indicating that "apple" is a valid word.
If the OCR omitted the space in "three apples", then I run:
vHttpRequest("https://en.wiktionary.org/wiki/threeapples", , "status")
Which returns 404, indicating that "threeapples" is not a valid word.
This is working very well. It correctly identifies most OCR errors. Two details in the process are that Wiktionary's search is case sensitive and does not include possessives, so if I get a 404, then I try again with conversion to lower case and with removal of the last two letters if they are "'s" or "s'".
The problem is when I get a word that is valid in some other language. Wiktionary splits its pages up with anchors for each language in which the word exists. So, for example, if the word is "ther", that word is valid in three other languages, but not in modern English. So I would like to run something like:
vHttpRequest("https://en.wiktionary.org/wiki/ther#English", , "status")
To test whether Wiktionary's page for "ther" has a section for English. The problem is that the above call returns 200 because the page for "ther" exists. The status check ignores the anchor "#English" in the URL.
Is there a way to test whether that anchor exists on that page? Also open to suggestions of better solutions to the problem.