There was a problem lately in a Django (python) system I work with, that some of the memcache requests took a really long time to finish, several seconds in some cases. This resulted long page loads and even caused some MySQL queries to pile up producing an undesired high load on the database. Something had to be done!
We need timeouts!
The main idea was that some timeout is needed, so that the system won’t wait forever for memcache. I started investigating Django caching, dug deeper and deeper until I hit the wall, named python-memcached. This is the library used to access memcache. Along the way I didn’t see any indication of timeout settings, but there were some options passed down to the library, so I still had some hope. Unfortunately it turned out that python-memcached does not support timeouts for the requests, it was a dead end.
Still, we need timeouts, I started looking for some other way…
Act I: Signals
Searching the internet was really productive, and I managed to find a timeout script which used signals. The idea behind this solution is that you set an asynchronous alert event in the future with
signal.alarm(), which when fired calls a handler function. A little added hurdle was that the original handler had to be stored, and set back when the timeout logic is done.
There was a problem with this solution, that
signal.alarm() only accepts integer values, which translate to seconds, so the smallest possible timeout was 1 second. This is already too much, since the memcache requests in question should finish around a few ms. No need to worry, there is another signal method for the job,
signal.setitimer(), which works with float values, so a fraction of a second could be set. The final code was like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
from functools import wraps import signal class TimeoutException(Exception): pass def timeout(timeout): def wrap_function(func): @wraps(func) def __wrapper(*args, **kwargs): def handler(signum, frame): raise TimeoutException() old = signal.signal(signal.SIGALRM, handler) signal.setitimer(signal.ITIMER_REAL, float(timeout) / 1000) try: result = func(*args, **kwargs) finally: signal.signal(signal.SIGALRM, old) return result return __wrapper return wrap_function
This is a working solution in simple programs, but in a framework, such as Django, signals are delicate things and the system relies on them. When I was testing this version with the test suite, it simply quit on the first
SIGALRM. I had to find another way.
Act II: Processes
I started looking for ways to do timeouts with parallel processing and found the multiprocessing tools of python. The idea here is to run the critical logic in a separate process, store the result in a queue, and terminate the process if it’s still running when the timeout comes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
from functools import wraps from multiprocessing import Process, Queue class TimeoutException(Exception): pass def timeout(timeout): def wrap_function(func): @wraps(func) def __wrapper(*args, **kwargs): def queue_wrapper(args, kwargs): q.put(func(*args, **kwargs)) q = multiprocessing.Queue() p = multiprocessing.Process(target=queue_wrapper, args=(args, kwargs)) p.start() p.join(float(timeout) / 1000) if p.is_alive(): p.terminate() p.join() raise TimeoutException() p.terminate() return q.get() return __wrapper return wrap_function
This solution was a step forward, but also had its own flaws. No more messing with signals (except for
SIGTERM, but it’s called on a different process, so no problem) is good, the test suit was running steadily, but it was really slow. One of my colleagues pointed out that this solutions wasn’t suited for large frameworks like Django either, since process creation is based on the
fork(2) system call, which creates a copy of the current process.
Copying the whole environment for a single call is a really bad idea. It uses a lot of memory and it takes a considerable amount of time, which in the end renders the whole timeout effort useless. In the test suite I set a 100ms timeout, and a lot of tests timed out, the system clearly waited the 100ms almost every time (hence the slowness). Some tests even failed, since they depend on the cache, which timed out frequently. Again, I had to find another way.
Act III: Threads
What is parallel processing and cheaper than a whole process: threads. I started by looking at the threading module of python, but my first experiments were clunky, I could start a thread with a function, but couldn’t get the results out. I started to lose hope, when I found ThreadPools.
It was in a icon-stack-overflow StackOverflow answer and it is the undocumented hidden gem of python concurrent programming. It can be found in multiprocessing tools and it uses the same interface as (process) Pools. Pools are really convenient, one can apply a functions on them asynchronously and get an async result object back. Then calling
get() on this object returns the result of the function applied on the pool, unless you set a timeout. If the thread didn’t finish before timeout a
TimeoutError exceptions is raised. Now imagine this with threads!
No more handlers, signals or manual queue management, and even the exception is raised, no wonder my first implementation was only a few lines, compared to the previous solutions, and it was working! But… as usual, it had some problems.
The first problem was, that it relies on an attribute, which is not accessible in python <2.7.2, but there was a clever fix for this. The other was, that once it was running in the test suite it exhausted all the threads really fast… because I was an idiot, creating a ThreadPool (with one thread) for every timeout. In a small environment, this caused no problem at all, but in the test suite it was showering errors (and segmentation faults). The fix was a lazy global ThreadPool, which is created only if needed and only once, and a
thread.errors, for extreme cases. After fixing everything, it was running like a charm, here is the final decorator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
from functools import wraps from multiprocessing import TimeoutError from multiprocessing.pool import ThreadPool import thread import threading import weakref thread_pool = None def get_thread_pool(): global thread_pool if thread_pool is None: # fix for python <2.7.2 if not hasattr(threading.current_thread(), "_children"): threading.current_thread()._children = weakref.WeakKeyDictionary() thread_pool = ThreadPool(processes=1) return thread_pool def timeout(timeout): def wrap_function(func): @wraps(func) def __wrapper(*args, **kwargs): try: async_result = get_thread_pool().apply_async(func, args=args, kwds=kwargs) return async_result.get(float(timeout) / 1000) except thread.error: return func(*args, **kwargs) return __wrapper return wrap_function
This was a long journey with many obstacles along the way, but I learned a lot about concurrent programming, processes and threads. I won’t say that I understand it fully though. The three versions I created show, that you can do things in many ways with python, but at the end only one would fit for your exact needs, and I wouldn’t be surprised If there was an even better solution out there…
You can find my experimenting in the following gist: https://gist.github.com/aorcsik/bcc17a299434ee2a2a1a