Short story:
Is Python 3 unicode string lookup O(1) or O(n)?
Long story:
Index lookup of a character in a C char array is constant time O(1) because we can with certainty jump to a contiguous memory location:
const char* mystring = "abcdef";
char its_d = mystring[3];
Its the same as saying:
char its_d = *(mystring + 3);
Because we know that sizeof(char) is 1 as C99, and because of ASCII one character fits in one byte.
Now, in Python 3, now that string literals are unicode strings, we have the following:
>>> mystring = 'ab€cd'
>>> len(mystring)
5
>>> mybytes = mystring.encode('utf-8')
>>> len(mybytes)
7
>>> mybytes
b'ab\xe2\x82\xaccd'
>>> mystring[2]
'€'
>>> mybytes[2]
226
>> ord(mystring[2])
8364
Being UTF-8 encoded, byte 2 is > 127 and thus uses a multibyte representation for the character 3.
I cannot other than conclude that a index lookup in a Python string CANNOT be O(1), because of the multibyte representation of characters? That means that mystring[2] is O(n), and that somehow a on-the-fly interpretation of the memory array is being performed ir order to find the character at index? If that's the case, did I missed some relevant documentation stating this?
I made some very basic benchmark but I cannot infer an O(n) behaviour: https://gist.github.com/carlos-jenkins/e3084a07402ccc25dfd0038c9fe284b5
$ python3 lookups.py
Allocating memory...
Go!
String lookup: 0.513942 ms
Bytes lookup : 0.486462 ms
EDIT: Updated with better example.