68

I noticed that my Python application is much slower when running it on python:2-alpine3.6 than running it without Docker on Ubuntu. I came up with two small benchmark commands and there's a huge difference visible between the two operating systems, both when I'm running them on an Ubuntu server, and when I'm using Docker for Mac.

$ BENCHMARK="import timeit; print(timeit.timeit('import json; json.dumps(list(range(10000)))', number=5000))"
$ docker run python:2-alpine3.6 python -c $BENCHMARK
7.6094589233
$ docker run python:2-slim python -c $BENCHMARK
4.3410820961
$ docker run python:3-alpine3.6 python -c $BENCHMARK
7.0276606959
$ docker run python:3-slim python -c $BENCHMARK
5.6621271420

I also tried the following 'benchmark', which doesn't use Python:

$ docker run -ti ubuntu bash
root@6b633e9197cc:/# time $(i=0; while (( i < 9999999 )); do (( i ++
)); done)

real    0m39.053s
user    0m39.050s
sys     0m0.000s
$ docker run -ti alpine sh
/ # apk add --no-cache bash > /dev/null
/ # bash
bash-4.3# time $(i=0; while (( i < 9999999 )); do (( i ++ )); done)

real    1m4.277s
user    1m4.290s
sys     0m0.000s

What could be causing this difference?

Underyx
  • 1,001

2 Answers2

86

I've run the same benchmark as you did, using just Python 3:

$ docker run python:3-alpine3.6 python --version
Python 3.6.2
$ docker run python:3-slim python --version
Python 3.6.2

resulting in more than 2 seconds difference:

$ docker run python:3-slim python -c "$BENCHMARK"
3.6475560404360294
$ docker run python:3-alpine3.6 python -c "$BENCHMARK"
5.834922112524509

Alpine is using a different implementation of libc (base system library) from the musl project(mirror URL). There are many differences between those libraries. As a result, each library might perform better in certain use cases.

Here's an strace diff between those commands above. The output starts to differ from line 269. Of course there are different addresses in memory, but otherwise it's very similar. Most of the time is obviously spent waiting for the python command to finish.

After installing strace into both containers, we can obtain a more interesting trace (I've reduced the number of iterations in the benchmark to 10).

For example, glibc is loading libraries in the following manner (line 182):

openat(AT_FDCWD, "/usr/local/lib/python3.6", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 205 entries */, 32768)   = 6824
getdents(3, /* 0 entries */, 32768)     = 0

The same code in musl:

open("/usr/local/lib/python3.6", O_RDONLY|O_DIRECTORY|O_CLOEXEC) = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
getdents64(3, /* 62 entries */, 2048)   = 2040
getdents64(3, /* 61 entries */, 2048)   = 2024
getdents64(3, /* 60 entries */, 2048)   = 2032
getdents64(3, /* 22 entries */, 2048)   = 728
getdents64(3, /* 0 entries */, 2048)    = 0

I'm not saying this is the key difference, but reducing the number of I/O operations in core libraries might contribute to better performance. From the diff you can see that executing the very same Python code might lead to slightly different system calls. Probably the most important could be done in optimizing loop performance. I'm not qualified enough to judge whether the performance issue is caused by memory allocation or some other instruction.

  • glibc with 10 iterations:

    write(1, "0.032388824969530106\n", 210.032388824969530106)
    
  • musl with 10 iterations:

    write(1, "0.035214247182011604\n", 210.035214247182011604)
    

musl is slower by 0.0028254222124814987 seconds. As the difference grows with number of iterations, I'd assume the difference is in memory allocation of JSON objects.

If we reduce the benchmark to solely importing json we notice the difference is not that huge:

$ BENCHMARK="import timeit; print(timeit.timeit('import json;', number=5000))"
$ docker run python:3-slim python -c "$BENCHMARK"
0.03683806210756302
$ docker run python:3-alpine3.6 python -c "$BENCHMARK"
0.038280246779322624

Loading Python libraries looks comparable. Generating list() produces bigger difference:

$ BENCHMARK="import timeit; print(timeit.timeit('list(range(10000))', number=5000))"
$ docker run python:3-slim python -c "$BENCHMARK"
0.5666235145181417
$ docker run python:3-alpine3.6 python -c "$BENCHMARK"
0.6885563563555479

Obviously the most expensive operation is json.dumps(), which might point to differences in memory allocation between those libraries.

Looking again at the benchmark, musl is really slightly slower in memory allocation:

                          musl  | glibc
-----------------------+--------+--------+
Tiny allocation & free |  0.005 | 0.002  |
-----------------------+--------+--------+
Big allocation & free  |  0.027 | 0.016  |
-----------------------+--------+--------+

I'm not sure what is meant by "big allocation", but musl is almost 2× slower, which might become significant when you repeat such operations thousands or millions of times.

Tombart
  • 1,805
  • 1
  • 16
  • 18
6

Here is an interesting discussion on the alpine mailing list where you can read this:

  • Is this situation well-know?

Yes. It is known that some workloads, mostly involving malloc, and heavy amounts of C string operations benchmark poorly verses glibc.

This is largely because the security hardening features of musl and Alpine are not zero cost, but also because musl does not contain micro-architecture specific optimizations, meaning that on glibc you might get strlen/strcpy type functions that are hand-tuned for the exact CPU you are using.

In practice though, performance is adequate for most workloads.

  • Is the musl memory allocation a good lead to explain performance differences?

It is known that some memory allocation patterns lead to bad performance with the new hardened malloc. However, the security benefits of the hardened malloc to some extent justifies the performance costs, in my opinion. We are still working to optimize the hardened malloc.

One workaround might be to use jemalloc instead, which is available as a package. I am investigating a way to make it possible to always use jemalloc instead of the hardened malloc for performance-critical workloads, but that will require some discussion with the musl author which I haven't gotten to yet.

  • Can memory allocation explain the whole thing?

About 70%, I would say.

azmeuk
  • 231