Monday 4 May 2015

How much memory will it take to store a single character string in Python 3?

Been investigating an unexpected bloat in one of my Python programs and stumbled upon this fairly surprising result. As I was storing a lot of short strings, I decided to see how much space a one character long string ('x') takes. As it turns out, a whole friggin' lot! Below I'm providing the structure used to store a string object in CPython, with all the macros expanded. Every string object gets a separate copy of this structure. See unicodeobject.h and object.h for details.

The data below pertains to the 64-bit interpreter version, on the 32-bit version the total size should be smaller.

typedef struct {
    /* _PyObject_HEAD_EXTRA */
    /* me: All objects are linked in a list. */
    struct _object *_ob_next;
    struct _object *_ob_prev;
    /* PyObject_HEAD */
    /* me: Py_ssize_t is 4 bytes long. */
    Py_ssize_t ob_refcnt
    struct _typeobject *ob_type;
    Py_ssize_t length;          /* Number of code points in the string */
    /* me: Py_hash_t is 4 bytes long. */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;

Total = sizeof(PyAscIIObject) + strlen('x') = 48 + 2 = 50. The same number can be obtained from within Python by using the standard sys.getsizeof() function:
>>> sys.getsizeof('x')
50

50 bytes to store a single character string. Yikes!

P.S. One seemingly obvious optimization would be to lay objects in memory in a way that logically adjacent objects are close to each other in memory, so instead of using 8-byte pointers 2/4-byte offsets would suffice. (I'm a newbie though, so I have absolutely no clue how feasible that would be in CPython.)