Unicode consisting of only one character (with value smaller than 128 or more precisely from latin1) is the most complicated case, because those strings aren't really interned but (more similar to the integer pool or identically to the behavior for bytes) are created at the start and are stored in an array as long as the interpreter is alive:
truct _Py_unicode_state {
...
/* Single character Unicode strings in the Latin-1 range are being
shared as well. */
PyObject *latin1[256];
...
/* This dictionary holds all interned unicode strings...
*/
PyObject *interned;
...
};
So every time a length 1 unicode is created, the character value gets looked up if it is in the latin1-array. E.g. in unicode_decode_utf8:
/* ASCII is equivalent to the first 128 ordinals in Unicode. */
if (size == 1 && (unsigned char)s[0] < 128) {
if (consumed) {
*consumed = 1;
}
return get_latin1_char((unsigned char)s[0]);
}
One could even argue, if there is a way to circumvent this in the interpreter - we speak about a (performance-) bug.
A possibility is to populate the unicode-data by ourselves using C-API. I use Cython for the proof of concept, but also ctypes could be used to the same effect:
%%cython
cdef extern from *:
"""
PyObject* create_new_unicode(char *ch)
{
PyUnicodeObject *ob = (PyUnicodeObject *)PyUnicode_New(1, 127);
Py_UCS1 *data = PyUnicode_1BYTE_DATA(ob);
data[0]=ch[0]; //fill data without using the unicode_decode_utf8
return (PyObject*)ob;
}
"""
object create_new_unicode(char *ch)
def gen1():
return create_new_unicode(b"1")
Noteworthy details:
PyUnicode_New would not look up in latin1, because the characters aren't set yet.
- For simplicity, the above works only for ASCII characters - thus we pass
127 as maxchar to PyUnicode_New. As result, we can interpret data via PyUnicode_1BYTE_DATA which makes it easy to manipulate it without much ado manually.
And now:
a,b=gen1(), gen1()
a is b, a == b
# yields (False, True)
as wanted.
Here is a similar idea, but implemented with ctypes:
from ctypes import POINTER, py_object, c_ssize_t, byref, pythonapi
PyUnicode_New = pythonapi.PyUnicode_New
PyUnicode_New.argtypes = (c_ssize_t, c_ssize_t)
PyUnicode_New.restype = py_object
PyUnicode_CopyCharacters = pythonapi._PyUnicode_FastCopyCharacters
PyUnicode_CopyCharacters.argtypes = (py_object, c_ssize_t, py_object, c_ssize_t, c_ssize_t)
PyUnicode_CopyCharacters.restype = c_ssize_t
def clone(orig):
cloned = PyUnicode_New(1,127)
PyUnicode_CopyCharacters(cloned, 0, orig, 0, 1)
return cloned
Noteworthy details:
- It is not possible to use
PyUnicode_1BYTE_DATA with ctypes, because it is a macro. An alternative would be to calculate the offset to data-member and directly access this memory (but it depends on the platform and doesn't feel very portable)
- As work-around,
PyUnicode_CopyCharacters is used (there are probably also other possibilities to achieve the same), which is more abstract and portable than directly calculating/accessing the memory.
- Actually,
_PyUnicode_FastCopyCharacters is used, because PyUnicode_CopyCharacters would check, that the target-unicode has multiple references and throw. _PyUnicode_FastCopyCharacters doesn't perform those checks and does as asked.
And now:
a="1"
b=clone(a)
a is b, a==b
# yields (False, True)
For strings longer than 1 character, it is a lot easier to avoid interning, e.g.:
a="12"
b="123"[0:2]
a is b, a == b
#yields (False, True)