Graceful restart of application servers/ gunicorn workers

4 min readApr 7, 2021

We need to restart applications hundreds of times during the life-cycle of web apps to reload the state of our app. The reasons can be code update, configuration changes, migrations etc.

Problem with normal restart

Normal restart causes the app server to stop immediately by killing the processes. It doesn’t avoid destroying workers/threads that are still managing requests rather just stops and start the server again without giving the chance to complete its execution.

We can do normal restart by finding the child process and killing it via its PID. Parent process will spawn the child again.

ps aux | grep gunicorn |grep <program-name> | awk '{ print $2 }' | xargs kill

What is a graceful restart?

Graceful restart of processes ensure that “everything” is cleaned up gracefully without losing current operations in execution before the application exits. “Everything” can be database connections, processing jobs, requests etc.

Code setup for benchmarking

We will run the following setup for benchmarking that we can achieve graceful restart with correct supervisor configuration and downsides associated with it:

Application Server: a simple django application server hosted on cloud which simply logs the request number in a database table.

Nginx: A proxy to forward the request to application server.

Gunicorn: To run application workers.

Supervisord: A daemon for managing application processes. From supervisorctl, a user can connect to different supervisord processes.

Client server: A tiny program that sends load to web application provided number of requests in the provided concurrency level and prints stats.

You can find the code setup here .

Experiments

We did following experiments:

Experiment 1:

Spawned a heavy task that takes 300 seconds to complete. At each second it logs count value in a table.
Triggered the task and restarted the supervisor program before it could complete.

Observation:

Server logged all 300 values in db over 300 seconds.
The server gave 502 gateway error on subsequent requests in this time period of 300 seconds.

Experiment 2:

Flooded the app server with 1000+ requests through an automated program.
Restarted the supervisor program at 1:28:31 while requests flood.

Observation :

122 requests were dropped while restarting the server in stats.
The red curve in graph 1 plots the failure rate of the requests which went up during restart and became 0 again post the restart.

Findings

sudo supervisorctl rereadsudo supervisorctl updatesudo supervisorctl restart <program-name>

Finding #1:

Using Supervisor restart command instructs the gunicorn workers to do a graceful shutdown.
Gunicorn uses process signal TERM to restart gracefully (docs) which was sent by default while restarting, by supervisor.
But, make sure supervisor does not kill the worker before the worker attempts to graceful shutdown by setting “stopwaitsecs” in your supervisor config file to a value higher than the estimated job processing time. (Ref)

Finding #2:

This restart is not a zero downtime solution but takes few seconds to serve new requests.

Finding #3:

HUP signal does the similar shutdown as compared to TERM, Requests drop were slightly lesser but it had an interesting problem.

Official doc states:

HUP: Reload the configuration, start the new worker processes with a new configuration and gracefully shutdown older workers. If the application is not preloaded (using the preload_app option), Gunicorn will also load the new version of it.

Gunicorn’s HUP-reload will fail if you switch your codebase using symlinks.

Digging deeper found few reasons.

Many of the Python standard library functions that Gunicorn employs, following POSIX, resolve symlinks to absolute paths.
Gunicorn stores its working directory at startup time, effectively resolving the symlink once at startup, rather than every time we reload. So, simply reloading the worker config will not point to newest symlink.

Conclusion

Using supervisor to manage the process solves our problem. Challenge with normal restart is the worker will be responsible for restarting itself when using kill command so this is prone to problems.

Different tools implement their signals differently, eg, nginx uses SIGHUP for graceful restart while apache uses the same signal for hard restart. We need a way to abstract out these command to signal mappings and make it configurable so that it is not tool dependent.

References: