If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
- I'd like to ensure that those hash to the same thing.
- I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
- Any normalization procedure may cause collisions, e.g.
"office" == "office". - Normalization can change the number of bytes in the string.
Further questions
- What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
- What happens if the server receives characters that are unassigned in its version of Unicode?