Debugging black screen in Debian Installer

25 May 2018

What was noticed?

A few weeks ago, other developers reported on the #debian-boot IRC channel that they were seeing regressions while testing daily builds of the graphical version of the Debian Installer. This could easily be reproduced by anyone, fetching the latest netboot/gtk/mini.iso file for the amd64 architecture, which can be found in the directory of daily builds for amd64.

Timing-wise, it might appear that coincided with the move of all our git repositories from the good, old alioth.debian.org hosting service to the brand new salsa.debian.org, the Gitlab instance replacing the previous Fusionforge one. But that transition had nothing to do with our issue, which is rather related to our move from the 4.15 kernel series to the 4.16 one.

First investigation

Looking at that black screen, it looked like the X server was starting just fine, but nothing was being drawn, and input devices (keyboard, mouse) appeared inactive. Thankfully, the QEMU/KVM monitor makes it possible to trigger a switch to a different TTY, to access log files and run a few commands. It was possible to extract kernel logs and X logs to compare what was happening with different kernel versions. The big surprise was: the logs were identical, except for the uname output and the timestamps embedded in the logs. Switching back to the TTY where X was expected to show up, here it was, properly started and displaying the usual language selection screen! Moving from X logs to system logs (including kernel logs):

May 12 02:25:02 kernel: [    3.302164] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.302291] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.302351] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.302450] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.302530] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.302682] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.302804] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.302963] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.303009] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:02 kernel: [    3.303078] random: debconf: uninitialized urandom read (8 bytes read)
May 12 02:25:19 init: starting pid 170, tty '/dev/tty2': '-/bin/sh'
May 12 02:25:21 debconf: Setting debconf/language to en
May 12 02:25:21 main-menu[229]: DEBUG: resolver (libgcc1): package doesn't exist (ignored)
May 12 02:25:21 main-menu[229]: INFO: Falling back to the package description for brltty-udeb
May 12 02:25:21 main-menu[229]: INFO: Menu item 'localechooser' selected
May 12 02:25:22 debconf: Setting debconf/language to en
May 12 02:25:22 gtk-set-font: Switching to font 'DejaVu Sans' for 'en'
May 12 02:25:24 kernel: [   25.303708] random: crng init done

So it appears there were randomness/entropy-related issues very soon after boot-up; then, roughly 20 seconds later, debconf (which drives the installation process, asking questions and waiting for answers) was properly started; closely followed by the kernel reporting some “random” initialization had finished. This rang a bell so I’ve reported my findings as #898468, putting the Debian kernel team in copy.

Ben Hutchings swiftly replied with some more information about the possible culprit. Traditionally, X clients and X servers share an authentication cookie (see the XAUTHORITY environment variable, which can point to e.g. ~/.Xauthority, or the xauth command line tool), called MIT-MAGIC-COOKIE-1, which requires some bits of randomness. It had been reported that some desktop session programs rely on libICE to handle that cookie, which uses a function from the libbsd library, arc4random_buf(), which in turn uses the getrandom() system call exposed by the kernel.

That getrandom() syscall was recently modified as part of a security fix, and this made callers block more easily when not enough entropy has been gathered. A parameter/an option can be specified to make this call non-blocking, but it would then return an error immediately instead of blocking. More information about this can be found in the Fixing Linux getrandom() in stable thread on the debian-release@ mailing list, but basically the idea was to get libbsd fixed in unstable/testing rather than simply reverting the security fix (the latter was preferred as a follow-up to the security updates for stable which were triggering severe regressions).

The installer team hadn’t checked that libICE was actually used by the installer, but waiting for the libbsd fix to rebuild and recheck was deemed like a sensible course of action. Unfortunately, right after a first bug fix upload was accepted into unstable, no changes could be noticed in a freshly rebuilt installer…

Deeper investigations

A quick reality check revealed that the absence of changes was to be expected, as the graphical installer doesn’t use the libICE library, which doesn’t even ship any udebs…

Back to square one/screen black: debconf was being reported as the process affected by the blocking behaviour of the getrandom() call so the first step was to chase down randomness-related calls in its source code. Unfortunately nothing turned up…

Next try was diving into the gtk+2.0 source code, since that’s the graphical toolkit used by the graphical installer. That looked promising at first since there were a few calls to g_random_int() to initialize the internal stamp field for some data structures. That function is defined in the glib1.0 package (see glib/grand.c), and trying to get rid of the calls from gtk+2.0 made no differences at all…

At this point, it wasn’t clear which part was responsible for the calls to getrandom(), so Cyril went for a brute force approach. A new custom build of the graphical installer was prepared, including the strace tool (which traces syscalls); and modifying the start-up scripts to wrap debconf’s execution with strace. The rootskel package ships this line in /lib/debian-installer/menu:

exec debconf -o d-i $MENU

which was modified by a sed call in the main Makefile of the debian-installer package (build/Makefile):

sed 's,exec debconf,exec strace -v -f -s 400 -o /tmp/debconf.strace debconf,' -i "$(TREE)/lib/debian-installer/menu"

This resulted in strace writing a /tmp/debconf.strace file which could then be used to debug this further. Since the Debian Installer comes with little debugging tools, this file can be extracted using a simple network pipe, with nc/netcat.

A new hope

Searching for getrandom() calls, they were apparently appearing right after opening the /etc/fonts/fonts.conf file, but also while opening other files under the /etc/fonts/fonts.d/ directory, which contains configuration snippets for fontconfig. Searching its source code, it appeared two files could be relevant:

src/fccache.c which calls uuid_generate_random();
src/fccompat.c defines an FcRandom() function which leverages random_r() if detected through the configure script (HAVE_RANDOM_R).

Adding a few fprintf(stderr, "...\n"); statements around all suspicious calls made it clear: the uuid_generate_random() calls (implemented in libuuid1/libuuid1-udeb, packages built from the util-linux source) are responsible for the delays. Let’s look at the implementation in lib/randutils.c, the UUID generation calling get_random_bytes() in the same file:

/*
 * Generate a stream of random nbytes into buf.
 * Use /dev/urandom if possible, and if not,
 * use glibc pseudo-random functions.
 */
#define UL_RAND_READ_ATTEMPTS   8
#define UL_RAND_READ_DELAY      125000  /* microseconds */

void random_get_bytes(void *buf, size_t nbytes)
{
        unsigned char *cp = (unsigned char *)buf;
        size_t i, n = nbytes;
        int lose_counter = 0;

#ifdef HAVE_GETRANDOM
        while (n > 0) {
                int x;

                errno = 0;
                x = getrandom(cp, n, GRND_NONBLOCK);
                if (x > 0) {                    /* success */
                       n -= x;
                       cp += x;
                       lose_counter = 0;

                } else if (errno == ENOSYS) {   /* kernel without getrandom() */
                        break;

                } else if (errno == EAGAIN && lose_counter < UL_RAND_READ_ATTEMPTS) {
                        xusleep(UL_RAND_READ_DELAY);    /* no etropy, wait and try again */
                        lose_counter++;
                } else
                        break;
        }

        if (errno == ENOSYS)
#endif
        /*
         * We've been built against headers that support getrandom, but the
         * running kernel does not.  Fallback to reading from /dev/{u,}random
         * as before
         */
        {
                int fd = random_get_fd();
                /* … */

So if the HAVE_GETRANDOM macro was defined at compile-time, a getrandom() call is attempted, in non-blocking mode. If the ENOSYS error is triggered, the break; gets us out of the loop, and the code carries one with a fallback to reading from /dev/urandom or /dev/random (one gets the relevant file descriptor with random_get_fd() as seen on the last line.

But if the error is EAGAIN instead, that means not enough entropy was available. In this case, there’s a little delay getting added through xusleep(), and further attempts are made until a given limit (UL_RAND_READ_ATTEMPTS) is reached. At this point, the final break; triggers an exit of the loop and the fallback described above is reached.

Where to go from here?

To make extra sure diagnostics were correct, a little patch was developed as a proof of concept, disabling the HAVE_GETRANDOM macro, so that the while loop quoted above isn’t even tried, and so that one jumps directly to the fallback reading from /dev/*random. It was a bit tricky to get this part right since there are several places where it can be enabled: a configure.ac check, another check on the __NR_getrandom macro, and a final check on the SYS_getrandom macro.

--- a/configure.ac
+++ b/configure.ac
@@ -462,7 +462,6 @@ AC_CHECK_FUNCS([ \\
 	getdtablesize \\
 	getexecname \\
 	getmntinfo \\
-	getrandom \\
 	getrlimit \\
 	getsgnam \\
 	inotify_init \\
--- a/lib/randutils.c
+++ b/lib/randutils.c
@@ -27,14 +27,14 @@
 
 #ifdef HAVE_GETRANDOM
 # include <sys/random.h>
-#elif defined (__linux__)
+#elif 0
 # if !defined(SYS_getrandom) && defined(__NR_getrandom)
    /* usable kernel-headers, but old glibc-headers */
 #  define SYS_getrandom __NR_getrandom
 # endif
 #endif
 
-#if !defined(HAVE_GETRANDOM) && defined(SYS_getrandom)
+#if 0
 /* libc without function, but we have syscal */
 #define GRND_NONBLOCK 0x01
 #define GRND_RANDOM 0x02

This patch ensures a direct read from /dev/*random, which means delays go away, and that Debian Installer starts properly. Hurray?

Unfortunately, it doesn’t seem reasonable to degrade the quality of randomness used by a function called uuid_generate_random(), since it might be used by something a little more critical than setting UUIDs for fonts in fontconfig; a quick search on sources.debian.org returns 58 packages with that name in their source code at the very least. Increasing this function’s complexity to add an I don’t care about entropy quality option doesn’t seem too good either. So maybe it might make sense to see if fontconfig might be tweaked to use something that wouldn’t rely on getrandom()…

Time to report these findings to the Debian Installer team as a whole, adding fontconfig and util-linux developers to the loop. Let’s see what consensus we can reach.

Update (June 2018)

Ben Hutchings saved the day one more time, checking my status report and quickly pointing out that the same issue was encountered with plymouth as well. That component also runs very early in the boot process, to provide with a graphical “splash” screen.

Borrowing the fix from plymouth was very easy, as it was only about creating the fontconfig cache at build-time thanks to an fc-cache call; and that finally made this issue vanish!

There was still a little issue left to fix, as the X server had been updated in the meanwhile: a transition to a major version of the X stack had started in unstable in the meanwhile, and there was a regression with the fbdev driver. The server would fail to start with:

Xorg: symbol lookup error: /usr/lib/xorg/modules/drivers/fbdev_drv.so: undefined symbol: shadowUpdatePackedWeak

The Debian Installer side was tracked in #900317, even if the actual issue was first filed as #900613 against the xserver-xorg-video-fbdev package (also affecting its xserver-xorg-video-fbdev-udeb counterpart).

Having very responsive X people helped Cyril realize the patch was easy to grab and test, and that finally concluded the treasure hunt to repair the graphical installer! Two weeks after that, the Debian Installer Buster Alpha 3 release was published.