Beware of cPickle

The Python pickle module provides a way to serialize and deserialize Python objects. A large downside of the pickle format is that it is not secure, meaning you should not deserialize pickles received from untrusted sources.

There is also a cPickle version of the pickle module which implements the algorithm in C and is much faster than the pure Python module. This provides somewhat surprising use cases for the cPickle module besides the obvious application save format: it turns out cPickle can be the fastest way to make a copy of nested structures. Due to speed, using cPickle can also be attractive as a data format between trusted servers.

There is an issue that you need to watch out for in the cPickle module, though. When you are serializing to or deserializing from string using the dumps and loads functions respectively, the functions do not release the GIL! This took me by surprise: I did not expect anything in the stdlib to hold on to the GIL for anything that could potentially take a long time. You can try this out easily by creating a multithreaded application where one thread tries to use cPickle.dumps on multimegabyte data structure while the other treads are printing to screen for example. You will see that while dumps is running, the other threads are stopped.

Luckily there is an easy workaround: you can use the load and dump functions with cStringIO buffer or other file-like objects.

Note that I haven’t checked if this problem applies to Python 3.x.

Similar Posts:

7 Comments

  1. blackdew:

    Err… It can’t drop the GIL while it’s (de)serializing, because it needs to access/create python objects. That’s kind’a the point of the GIL. If cPickle dropped the gil you would get either garbage or a segfault.

  2. Mikhail Edoshin:

    Right; a C function can only release GIL if it does something lengthy that doesn’t involve Python.

  3. Brandon Craig Rhodes:

    Your last statement leaves me deeply confused: how on earth could dumping to and from a StringIO object — which is yet another GIL-governed object that needs to be manipulated, just like a string does — obviate the pickling library’s need to use the GIL to protect the Python data structures from simultaneous change while it is traversing them?

  4. Heikki Toivonen:

    Brandon: Note I said cStringIO. By the time the cStringIO comes into play, we already have data serialized in memory. All we are doing is writing bytes into a buffer. So for this operation the GIL can even be released, since it is all happening in C.

    blackdew and Mikhail: The point is that it would not need to hold the GIL ALL THE TIME. It could easily release the GIL periodically. That is in fact a workaround I did until I realized that dump and load work around the problem. If you look at the source (python.org svn is not responding so can’t provide a link or check where exactly), you will see that the implementation does release the GIL around some operations.

  5. Jack Diederich:

    I’m surprised that there is a difference in timing between cPickle.dump and dumps. dumps/loads create a cStringIO object under the covers and then call the same routines as plain dump.

  6. Fai:

    I have encountered the same problem recently. I am running a multi-threaded python server, when one of the threads is doing cPickle.dumps() of 500 MB data, all other threads halt because of GIL.

    After trying different combinations, I found out the followings will release and acquire the GIL during the function call.

    1. cPickle.dump() with StringIO.StringIO();
    2. pickle.dump() with cStringIO.StringIO();
    3. pickle.dump() with StringIO.StringIO().

    cPickle.dump() and StringIO.StringIO() is the fastest.
    pickle.dump() and StringIO.StringIO() throws an MemoryError exception with a huge python object.

  7. Fai:

    [Miss something in the above message]

    I have encountered the same problem recently. I am running a multi-threaded python server, when one of the threads is doing cPickle.dumps() of 500 MB data, all other threads halt because of GIL.

    After trying different combinations, cPickle.dump() with cStringIO.StringIO() still holds the GIL during the whole function call. I found out the followings will release and acquire the GIL during the function call.

    1. cPickle.dump() with StringIO.StringIO();
    2. pickle.dump() with cStringIO.StringIO();
    3. pickle.dump() with StringIO.StringIO().

    cPickle.dump() and StringIO.StringIO() is the fastest.
    pickle.dump() and StringIO.StringIO() throws an MemoryError exception with a huge python object.