First, the basics.
- Strip non-letters.
- Case fold.
- Convert to ASCII equivalents.
The first one is straight-forward, strip out everything which isn't a letter so São Paulo and São Paulo. are both SãoPaulo.
Case folding is also straight-forward, change everything to lower or upper case. são paulo and São Paulo compare the same.
Finally, convert them to the normal ASCII equivalents. For example, são becomes sao.
With this normalization done, the issues of spaces, extra characters, accents, and cases are taken are of. I would recommend doing this outside of BigQuery and in a language like Python. Do a select distinct and transform and compare each value using libraries such as unidecode.
You can then employ some heuristics to try and find "close enough" matches. One example is the Levenshtein distance which is the number of substitutions, insertions, and deletions one needs to do to turn one string into another. Python has a Levenshtein library.
For example, Sao Paul and Sao Paulo have a Levenshtein distance of one; add one letter. S Paulo and Sao Paulo have a Levenshtein distance of two, add two letters. Sao Paulo and Saint Paul have a Levenshtein distance of four; change o to i, add n and t, remove o.
Again, I'd recommend doing this with a regular programming language and then writing the normalized results back to BigQuery.