Been investigating an unexpected bloat in one of my Python programs and stumbled upon this fairly surprising result. As I was storing a lot of short strings, I decided to see how much space a one character long string ('x') takes. As it turns out, a whole friggin' lot! Below I'm providing the structure used to store a string object in CPython, with all the macros expanded. Every string object gets a separate copy of this structure. See unicodeobject.h and object.h for details.
The data below pertains to the 64-bit interpreter version, on the 32-bit version the total size should be smaller.
Total = sizeof(PyAscIIObject) + strlen('x') = 48 + 2 = 50. The same number can be obtained from within Python by using the standard sys.getsizeof() function:
50 bytes to store a single character string. Yikes!
P.S. One seemingly obvious optimization would be to lay objects in memory in a way that logically adjacent objects are close to each other in memory, so instead of using 8-byte pointers 2/4-byte offsets would suffice. (I'm a newbie though, so I have absolutely no clue how feasible that would be in CPython.)
The data below pertains to the 64-bit interpreter version, on the 32-bit version the total size should be smaller.
typedef struct {
/* _PyObject_HEAD_EXTRA */
/* me: All objects are linked in a list. */
struct _object *_ob_next;
struct _object *_ob_prev;
/* PyObject_HEAD */
/* me: Py_ssize_t is 4 bytes long. */
Py_ssize_t ob_refcnt
struct _typeobject *ob_type;
Py_ssize_t length; /* Number of code points in the string */
/* me: Py_hash_t is 4 bytes long. */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
unsigned int :24;
} state;
wchar_t *wstr; /* wchar_t representation (null-terminated) */
} PyASCIIObject;
Total = sizeof(PyAscIIObject) + strlen('x') = 48 + 2 = 50. The same number can be obtained from within Python by using the standard sys.getsizeof() function:
>>> sys.getsizeof('x')
50
50 bytes to store a single character string. Yikes!
P.S. One seemingly obvious optimization would be to lay objects in memory in a way that logically adjacent objects are close to each other in memory, so instead of using 8-byte pointers 2/4-byte offsets would suffice. (I'm a newbie though, so I have absolutely no clue how feasible that would be in CPython.)
No comments:
Post a Comment