0

I'm writing a simple OCR checker in Excel that parses the OCR output text file into words and uses Wiktionary to check a selection of words to see if they are valid words.

I know there are sophisticated dictionary lookup systems that run in Python, but I'm trying to get this done without needing to get into Python. So I'm using Excel and Wiktionary for a simple approach.

I have a VBA function called vHttpRequest() that accesses a URL and can return the status returned by doing that. For example, if the word is "apple", then I run:

vHttpRequest("https://en.wiktionary.org/wiki/apple", , "status")

Which gives me status 200, indicating that "apple" is a valid word.

If the OCR omitted the space in "three apples", then I run:

vHttpRequest("https://en.wiktionary.org/wiki/threeapples", , "status")

Which returns 404, indicating that "threeapples" is not a valid word.

This is working very well. It correctly identifies most OCR errors. Two details in the process are that Wiktionary's search is case sensitive and does not include possessives, so if I get a 404, then I try again with conversion to lower case and with removal of the last two letters if they are "'s" or "s'".

The problem is when I get a word that is valid in some other language. Wiktionary splits its pages up with anchors for each language in which the word exists. So, for example, if the word is "ther", that word is valid in three other languages, but not in modern English. So I would like to run something like:

vHttpRequest("https://en.wiktionary.org/wiki/ther#English", , "status")

To test whether Wiktionary's page for "ther" has a section for English. The problem is that the above call returns 200 because the page for "ther" exists. The status check ignores the anchor "#English" in the URL.

Is there a way to test whether that anchor exists on that page? Also open to suggestions of better solutions to the problem.

Giacomo1968
  • 58,727
NewSites
  • 862

2 Answers2

4

Have you tried using their API ? They have recently added a REST API you can use to query for words (link) which you can use to get formatted JSON back which you could easely parse.

An example of looking up the word "Arbiter" (found in multiple languages) would look like this

https://en.wiktionary.org/api/rest_v1/page/definition/arbiter

The result will have multiple entries, but as you can see you would only have to look if the "en" block exists to know if its an English word or not.

Silbee
  • 1,549
  • 8
  • 12
1

I accepted the answer from @Silbee because it provides a good solution -- better than the anchor detection I asked for -- to the problem of determining programmatically from Wiktionary if a word is valid in English.

However, I've also found an even better solution to the more general problem of determining programmatically if a word is valid in English.

I submitted the same question that I submitted here to ChatGPT. It didn't tell me about the API available for Wiktionary, but it did suggest some alternative solutions. Most of them required using python, but one did not. That was to use the CheckSpelling method of MS Word. (A "method" is a function that runs on an object.) That can be run on an MS Word document directly in Word VBA, but it can also be run on a range of cells in Excel by calling it from a sub in Excel VBA. Nifty trick. The advantage of doing this instead of the Wiktionary lookup is that it runs locally on my computer, so I don't have the overhead of doing an http request for every word. This could make a big difference when looking up more than a few words. (I'm running a local installation of Office 2021, not an online version.)

One thing to note if you're going to try this: There are at least five CheckSpelling methods in MS Word and Excel, three in Word and two in Excel. Of the five, three of them are methods of a range or document, and those all open the a proofing dialog box, so are not suitable for programmatic use. The two that are suitable for programming are the method of the Word application that I linked to above and a method of the Excel application. At first glance, when working in Excel, it would seem better to use the Excel method, but ChatGPT told me it's less capable than the Word version.

ChatGPT detailed the differences between the two methods. I asked how it knew that since there's no mention of a difference in the MS docs. It said, “The detailed differences … are not typically documented in one specific source. Instead, they are inferred from various sources, documentation, and practical experience. Here are some steps you can take to find more information on this topic:” It then gave me a list of five types of sources, including official documentation, support forums, books, blogs, and articles. I think I can paraphrase that answer as “Good luck, you feeble human, trying to digest the breadth of information I used to be able to tell you about those differences.”

NewSites
  • 862