Numpy broke my heart

python
joblib
multiprocessing
numpy
Author

Nicolas Chopin

Published

April 25, 2024

I swear, the title is kind of funny in French (try to figure out why). Anyway, in this post I wanted to dispel a misconception I had until recently on python, numpy and multi-processing, and which led me to say something silly in our SMC book.

no comment

Python and multi-processing

Most modern computers have several CPU cores; I guess even potato computers have a least two these days? On the other hand, a program written in Python will be executed on a single core, because of the GIL. This means that all the other cores will stay idle while you run your program. Which is frustrating when said program takes forever to complete.

There are different ways to make all your CPU cores work for you, but I will discuss the only two ways which I am (a bit familiar) with:

  1. Use joblib or a similar library. (But seriously, just use joblib, it’s great.) This requires a bit of work, as you have to state explicitly which parts of you program may be turned into independent tasks that will be performed in parallel. The typical use case for me is to run several times the same SMC algorithm (perhaps with different parameters, e.g. a different number of particles); see for instance this.

  2. Do nothing, and pray that your program rely on those Numpy operations which are already parallelised for you (multithreaded). Numpy rely on low-level (C/Fortran) linear algebra libraries such as BLAS and LAPACK, and these libraries are able to implement certain operations (e.g. matrix multiplication) on multiple cores. In this case, your python script still runs on a single core, but, when it encounters a multithreaded numpy operation, this operation spawns (temporarily, for this operation only) several threads that are executed on different cores.

My bad

Ok, now for my misconception (a.k.a. what a idiot I am.) When I run on a standard PC the following script, which implements the numerical experiment of Chapter 17 (on SMC samplers) in our book, all the cores are kept busy during the execution. This script does not rely on any form of explicit parallelism. Several SMC samplers are run, but sequentially (I don’t use multiSMC in this script). So clearly it’s numpy that is doing its thing (point 2 above). In fact, by profiling it, one can see that most of the CPU time is spent in the one line that computes the log-likelihood of the logistic regression model, and this involves a matrix multiplication. So this makes sense.

In the book (page 352 if you want to check), I said naively: if you have k cores, you get a x k speed-up for free in this particular experiment. I thought that that was the case, because all my CPU cores were 100% busy the whole time.

However, I did some more testing recently and tried to compare the running time of this script when numpy does multithreading and when it does not. (See here on how you may disable multithreading in numpy.)

  1. On a standard PC, the speed-up is more like… one per-cent?

  2. On a certain cloud-based architecture that I’m currently playing with (and which rely on kubernetes containers), multithreading can actually slow down the script by a factor of 10 or more.

What’s going on?

I am not sure, I’m a bit out of my depth here. I guess what happens is that, for this particular script, the speed-up brought by multithreading is cancelled by the time it takes to generate new threads at the beginning of the numpy operation. (Remember that this must be done each time a line with a multithreaded numpy operation is executed.) In fact, the multithreaded operation seems to be a matrix/vector mutiplication, where the matrix is not very large. (It’s of size \((N, d)\), where \(N\) is the number of particles. I tried to increase N several times over, but it did not change the results.)

And things may get worse in containers, where either numpy might do wrong assumptions on the available resources, or you simply share resources with many other users. (Disclaimer, I don’t know what I’m talking about.)

Also, of course, this kind or results may depend on your hardware, the version of python and related libraries you are using (in particular whether you use the openBLAS version of BLAS of the MKL one which is specific to Intel CPUs, to see this, check the output numpy.config() on your machine.) and so on. The picture below summaries the situation.

Alice decided to better understand multiprocessing in Python

Enter joblib

The discussion above assumes you run a single program, and that Numpy may or may not get access to all the cores. What if you try to implement multi-processing (using joblib, multiprocessing or something else), but each task perform numpy operations? You could have a over-subscription problem, that is, you end up with many threads (more than the number of cores), and the computer wastes a lot of time trying to juggle between all these threads.

Fortunately, joblib is smart enough to tell numpy to calm down and generate fewer threads. This point is discussed here inin the documentation. Well worth a read.

I managed to speed up my script significantly by using joblib, but I still cannot obtain a x 24 factor on my niffy 24-core PC. I am still crying investigating.

Take-home messages

  • It’s not because all your 20 CPU cores are busy that your script is running 20 times faster.

  • If you actually want to achieve a substantial speed-up in a multi-core hardware, you might need to try different things, and check the actual results (i.e. measure the total running time).

  • Read this and this if you want to learn more about multiprocessing and numpy, I found these pages clear and authoritative on this topic.

  • don’t believe everything you read in books? :-)