What Unicode normalization (and other processing) is appropriate for passwords when hashing?

Question

If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?

Goals

Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.

I'd like to ensure that those hash to the same thing.
I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).

Reference

Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms

Considerations

Any normalization procedure may cause collisions, e.g. "oﬃce" == "office".
Normalization can change the number of bytes in the string.

Further questions

What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
What happens if the server receives characters that are unassigned in its version of Unicode?

Are you primarily concerned with users using different input methods on different devices? Your example include ligatures, but what about zero width joiners and combiners? What about similar but semantically distinct code-points like I (Latin Letter) vs Ⅰ (Roman Numeral) vs Ｉ (CJK Full-width)? — Mike Samuel, Apr 23 '13 at 17:35
I'm not concerned about homoglyphs -- it's unlikely they'll be able to type their entire password using an input method that only shares *some* (near-)glyphs -- but I'll have to think about joiners. It may be that preparing Unicode for password hashing needs a much more thorough approach. — Tim has moved to Codidact, Apr 24 '13 at 01:41

score 14 · Answer 1 · edited Nov 29 '19 at 08:15

Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.

Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)

The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:

11.1 Stability of Normalized Forms

For all versions, even prior to Unicode 4.1, the following policy is followed:

A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.

More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.

Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.

The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.

The spec warns:

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

However, semantics and round-tripping are not of concern here.

Recommendation #3: Apply NFKC or NFKD before hashing.

Good answer. It would be better with references to RFC 4013 (http://www.ietf.org/rfc/rfc4013.txt) and its coming replacement, saslprepbis (http://tools.ietf.org/html/draft-ietf-precis-saslprepbis-07). — Joe Hildebrand, Oct 06 '14 at 23:56
Oh, good references! Would you like to edit the answer to include those? I'm afraid all my Unicode normalization knowledge is swapped out right now. :-( — Tim has moved to Codidact, Oct 07 '14 at 21:25
Those references have been superseded. RFC 7613 (https://tools.ietf.org/html/rfc7613) obsoletes RFC 4013, and the PRECIS framework (https://tools.ietf.org/html/rfc7564) is the end result of the saslprepbis process. — devstuff, Mar 18 '17 at 13:40
NFKD is the way to go, if you use NFKC and a new pre-composed character is added to unicode then then the result will be different. — alextgordon, Sep 13 '18 at 18:24

score 2 · Answer 2 · answered Nov 19 '22 at 05:43

As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.

RFC 8265, § 4.1:

This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).

RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:

the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
- Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
- Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
- “Contextual Rule Required”: a number of code points from an “ Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.

I can’t speak for any other programming language, but the Python package precis-i18n implements the PRECIS framework described in RFCs 8264, 8265, 8266.

Here’s an example of how simple it is to enforce the OpaqueString profile on a password string:

# pip install precis-i18n
>>> import precis_i18n
>>> precis_i18n.get_profile('OpaqueString').enforce('å∆3⨁ucei=The4e-iy5am=3iemoo')
'å∆3⨁ucei=The4e-iy5am=3iemoo'
>>>

I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction and source of Python examples.

What Unicode normalization (and other processing) is appropriate for passwords when hashing?

Goals

Reference

Considerations

Further questions

2 Answers2

Linked