0

I was experimenting with extern and extern "C" for a little, and accidentially had a typo in one of the identifiers - a $ had snuck in. When I compiled the code and got the error of an undefined symbol and eventually saw what caused it, it made me curios if it would actually compile. And guess what - Clang actually did compile that.

According to documentation I had read previously, the rules for identifiers were basically:

  • No double underscore at the beginning - because those are reserved.
  • No single underscore and upper case letter - reserved too.
  • Must start with a letter, a non-digit.
  • Must not exceed 31 characters.
  • May contain a-z, A-Z or 0-9 and _.

But this compiled just fine - no warning was showing too:

void __this$is$a$mess() {}
int main() { __this$is$a$mess(); }

When looking at it:

Ingwie@Ingwies-Macbook-Pro.local /tmp $ clang y.c
Ingwie@Ingwies-Macbook-Pro.local /tmp $ nm a.out
0000000100000f90 T ___this$is$a$mess
0000000100000000 T __mh_execute_header
0000000100000fa0 T _main
                 U dyld_stub_binder

I can see the symbol name very clearly.

So why is it that Clang will let me do this, although by ANSI standards, it should not? Even the GCC 6 I have installed did not warn or error about this.

Which compilers will allow what kinds of identifiers - and, why actually?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Ingwie Phoenix
  • 2,703
  • 2
  • 24
  • 33

2 Answers2

6

The rules in the 2018 C standard for identifiers include:

  • Per 6.4.2.1 1, an identifier is a sequence of identifier-nondigit and digit characters, starting with an identifier-nondigit.
  • An identifier-nodigit is _, a to z, A to Z, a universal-character-name, or “other implementation-defined characters”.
  • A digit is 0 to 9.
  • A universal-character-name is \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits, which specify Unicode characters.

So, if an implementation allows $, that is a valid character for that implementation. You may use it, but it may not be portable to other implementations. The C standard requires implementations to accept the specific characters listed, but it allows them to accept more. Generally, the C standard should be viewed as an open field rather than a walled garden: The behavior is defined within the field, but you are not stopped at the barrier; you may go beyond it, at your own risk.

The rules you were taught were rules for what is portable, not rules for what the C standard requires implementations to restrict you to.

The C standard defines strictly conforming code, which is, roughly speaking, code that should work in any C implementation, and conforming code, which is code that works in at least one C implementation. Conforming code is still C code. So the rules you were taught were for strictly conforming code.

Generally, you should prefer to write strictly conforming code and only use additional features when benefit (speed, ease of development on a particular platform, whatever) is worth the cost (loss of portability).

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
5

According to documentation I had read previously, the rules for identifiers were basically:

  • No double underscore at the beginning - because those are reserved.
  • No single underscore and upper case letter - reserved too.

Such identifiers are indeed reserved, but that means that you must not declare or define them, not that they fail to be identifiers, or that they necessarily are not meaningful.

  • Must start with a letter, a non-digit.

Letters are indeed non-digits, but not all non-digits are letters. The _ character is a prime example.

  • Must not exceed 31 characters.

This is not a formal limit of the language. C requires that implementations support at least 31 significant characters in external identifiers. Two external identifiers that differ only at the 32nd character or later are not guaranteed to be recognized as distinct, but they do not fail to be identifiers. Furthermore, implementations must recognize at least 63 significant characters in internal identifiers, which, again, can be longer.

Some implementations recognize more significant characters, some even an unbounded number.

  • May contain a-z, A-Z or 0-9 and _.

Yes, but explicitly may also contain other implementation-defined characters. The $ character in particular is one that is fairly commonly allowed.

So why is it that Clang will let me do this, although by ANSI standards, it should not? Even the GCC 6 I have installed did not warn or error about this.

The standard does not by any means say that identifiers containing the $ character are disallowed. It explicitly permits implementations to accept that character and substantially any other in identifiers, though there are some that cannot pragmatically be allowed because allowing them would introduce ambiguity. Programs that use identifiers containing such characters do not for that reason fail to conform, and implementations that accept them do not for that reason fail to conform. Such programs do fail to strictly conform, however, as that term is defined by the standard.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
John Bollinger
  • 160,171
  • 8
  • 81
  • 157