Introduction

If you ever worked with Erlang you probably know that one of its core features is so called hot code swapping, which allows you to update an application without restring it. Of course, to be able to do this, code should be designed with this in mind, but since it’s deeply integrated into Erlang and its OTP framework this becomes relatively easy (in comparison with languages) to implement.

You might ask, why to bother and complicate an application by implementing this inside the app, since you could achieve something like this by using some external balancer and switching traffic on balancer. But this would work only for certain applications, primarily for those that have or require only short-lived connections, like HTTP, RPC or any other request response protocols. This will not work for any protocol that requires long-lived connection, like WebSockets.

In this post I’ll demonstrate how you could achieve something like this in Python and probably other languages, since it’s pretty generic. For simplicity, I will use simple echo server as an example, but this could be done with any protocol (at least in theory).

Erlang example

Before we proceed strictly to trying to implement this in Python, let’s first check how it’s done in Erlang. Here’s very simplistic TCP echo server that also maintains some state (since most real world applications have some state):

-module(echo_server).
-export([start/0, loop/2]).
-define(LISTEN_PORT, 1234).

start() ->
    listen().

listen() ->
    {ok, LSock} = gen_tcp:listen(?LISTEN_PORT, [binary, {active, false},
                                                {reuseaddr, true}]),
    accept(LSock).

accept(LSocket) ->
    {ok, Socket} = gen_tcp:accept(LSocket),
    spawn(fun() -> echo_server:loop(Socket, 1) end),
    accept(LSocket).

loop(Socket, State) ->
    case gen_tcp:recv(Socket, 0) of
        {ok, Data} ->
            Sym = list_to_binary(io_lib:format("~p", [State])),
            Line = [<<"Line ">>, Sym, <<": ">>, Data],
            gen_tcp:send(Socket, Line),
            NewState = State + 1,
            echo_server:loop(Socket, NewState);
        {error, closed} ->
            ok
    end.

This application binds to TCP port 1234 and starts accepting new connections from clients by calling gen_tcp:accept and then it spawns new light-weight process which calls loop function recursively.

If you have installed erlang on your machine you can compile this example via erlc echo_server.erl or by launching erl shell and using c(echo_server). command (do not forget the dot after closing bracket, it’s important). Well if you don’t, you could try to use docker image that has it, for example: docker run -it erlang /bin/bash which should launch bash shell that will provide necessary tools. Then you can launch this app by using erl shell and using echo_server:start(). command (again dot is important).

Now when it’s launched, let’s connect to server and try to change it without breaking client’s connection. For example let’s replace "Line" string with "New Line". I made small asciicast with this:

There you could see that application was updated without breaking connection from client, so how is it done? Erlang VM (which executes our code) aka BEAM (Bogdan’s/Björn’s Erlang Abstract Machine) has 2 different slots for code, slot for an old code and slot for a new one. So, when we do hot code swap, it moves previous new code into slot for an old code, and loads new one into new slot. Moreover, old code could execute a new one, but new one can’t execute old one. That’s exactly what’s happening, old loop function is calling new loop function. Since erlang is functional language and has no loop keywords like for or while, and any loop code should be implemented via recursion this guarantees that eventually everything would be using a new version of loaded code.

Of course our example is very simplistic and in many real word cases live update would require state update (since new version of code could represent internal state in different way), and usually real world app would use OTP behaviors like gen_server. Such behaviours require you to provide code_change function which would convert internal process state into a new one (or an old one if it’s downgrade).

Python echo server

Now let’s implement similar python echo server without live update first. I’ll do this almost without any libraries, since some Python libraries maintain some internal state and this would make example much simpler to understand. So I’m going to use plain old select, since it also supported by Windows. You can find my implementation here.

Adding live update

Now let’s consider our options for live update. When Erlang does hot code swap, it just changes VM code, process PID stays the same, while there are possibilities to make something like this in Python there is much easier way. We could just create another process and pass application state and sockets to a new process.

There are many ways how we could handle passing state, first we could just store state in some external database, and store locally only data that could be regenerated using state inside db (e.g. store just local cache), in that case we don’t need to pass state at all. But this would increase latency for accessing local state, which might be huge issue. Another option is to serialize and pass state via file, socket, pipe, etc.

Passing state is only half of the issue, since what’s really important is to pass sockets. In Linux there are 3 options for this:

  • Spawn new process as child from original one and make it inherit file descriptors for our sockets, this would also work on Windows too;

  • Pass file descriptors via unix socket, in that case our process isn’t required to be child of original one;

  • Use pidfd_getfd() syscall, which is very new and available only since Linux 5.6 kernel.

Also, I probably have to mention that you can’t pass socket via /proc/<pid>/fd/<fd> (that was the reason for adding pidfd_getfd syscall).

So let’s proceed, here I will describe first one since it’s easier to implement ;)

When our app will receive SIGUSR2 signal (I made it similar to nginx) it will call the following update method:

def update(self, sig, frame):
    rfd, wfd = os.pipe()
    os.set_inheritable(rfd, True)
    os.set_inheritable(0, True)
    os.set_inheritable(1, True)
    os.set_inheritable(2, True)

    for sock in (self.rsocks | self.wsocks | self.xsocks):
        sock.set_inheritable(True)

    pid = os.fork()
    if pid < 0:
        # error, fork isn't working
        return

    if pid == 0:
        env = os.environ
        env['HOT_SWAP'] = str(rfd)
        os.execve(__file__, sys.argv, env)
    else:
        try:
            os.write(wfd, self.json_state().encode("utf8"))
        finally:
            os.close(wfd)

        sys.exit(0)

It will first create pipe pair (reading and writing descriptors) and make reading descriptors inheritable, so child process could read from this pipe. We will use this pipe to pass application state. Also, we will make each socket descriptor inheritable too for the same reason. Then, we create a new process via os.fork(), in a child process it will return 0 and for parent it will return child PID. Then, in the parent process it will start writing application state into a pipe, while in a child process we will set environment HOT_SWAP to a descriptor with an app state and after that will do execve to a same executable file that was launched with the same arguments.

After execve our new child process will check HOT_SWAP environment and if it contains valid descriptor it will read application state from there, otherwise it will initialize initial empty state.

def run(self):
    state_fd = os.getenv('HOT_SWAP', None)
    if state_fd is not None:
        try:
            state_fd = int(state_fd)
        except ValueError:
            state_fd = None

    if state_fd is not None:
        self.read_state(state_fd)
    else:
        self.init_state()

    signal.signal(signal.SIGUSR2, self.update)

    try:
        self.reactor()
    except KeyboardInterrupt:
        self.close()

You can check full code here. Also I made another asciicast with our python example:

SystemD Helpers

Also, I have to mention that systemd has some helpers that can help you create applications that doing live update. I mean sd_pid_notify_with_fds function, which allows you to pass file descriptors using FDSTORE=1 state variable to service manager, later these descriptors could be read by sd_listen_fds_with_names or sd_listen_fds. For more information about this check manual for sd_pid_notify_with_fds and sd_listen_fds.

Is it really like in Erlang ?

Watchful reader could complain that my Python example isn’t fully similar to Erlang example, since when my example is doing update it stops serving clients till it is fully updated. So my Python example has small pause, while Erlang example has no pause and this might be critical for latency sensitive applications. But, since it was just an example, I’ve tried to make it as simple as possible and in real world case it would be possible to implement smarter way for passing state and file descriptors which will reduce this pause. For example, we could pass sockets and their state one by one, so they will gradually migrate to a new process and this will greatly reduce this pause.

Is any library or framework that could make this easier ?

As in Erlang, where hot code swap is tightly integrated into OTP platform, similar things could be done for Python, but unfortunately I don’t know any framework or library which does this, moreover a lot of popular networking Python libraries maintain some hidden state, so these libraries should be modified for use case like this.

Update: I also recommend you to check CloudFlare Blog Post about this topic where they provided Go examples.