What was noticed?
A few weeks ago, other developers reported on the #debian-boot
IRC channel that they were seeing regressions while testing daily builds of the graphical version of the Debian Installer. This could easily be reproduced by anyone, fetching the latest netboot/gtk/mini.iso
file for the amd64
architecture, which can be found in the directory of daily builds for amd64.
Timing-wise, it might appear that coincided with the move of all our git repositories from the good, old alioth.debian.org
hosting service to the brand new salsa.debian.org
, the Gitlab instance replacing the previous Fusionforge one. But that transition had nothing to do with our issue, which is rather related to our move from the 4.15 kernel series to the 4.16 one.
First investigation
Looking at that black screen, it looked like the X server was starting just fine, but nothing was being drawn, and input devices (keyboard, mouse) appeared inactive. Thankfully, the QEMU/KVM monitor makes it possible to trigger a switch to a different TTY, to access log files and run a few commands. It was possible to extract kernel logs and X logs to compare what was happening with different kernel versions. The big surprise was: the logs were identical, except for the uname
output and the timestamps embedded in the logs. Switching back to the TTY where X was expected to show up, here it was, properly started and displaying the usual language selection screen! Moving from X logs to system logs (including kernel logs):
May 12 02:25:02 kernel: [ 3.302164] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302291] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302351] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302450] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302530] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302682] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302804] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302963] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.303009] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.303078] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:19 init: starting pid 170, tty '/dev/tty2': '-/bin/sh' May 12 02:25:21 debconf: Setting debconf/language to en May 12 02:25:21 main-menu[229]: DEBUG: resolver (libgcc1): package doesn't exist (ignored) May 12 02:25:21 main-menu[229]: INFO: Falling back to the package description for brltty-udeb May 12 02:25:21 main-menu[229]: INFO: Menu item 'localechooser' selected May 12 02:25:22 debconf: Setting debconf/language to en May 12 02:25:22 gtk-set-font: Switching to font 'DejaVu Sans' for 'en' May 12 02:25:24 kernel: [ 25.303708] random: crng init done
So it appears there were randomness/entropy-related issues very soon after boot-up; then, roughly 20 seconds later, debconf
(which drives the installation process, asking questions and waiting for answers) was properly started; closely followed by the kernel reporting some “random” initialization had finished. This rang a bell so I’ve reported my findings as #898468, putting the Debian kernel team in copy.
Ben Hutchings swiftly replied with some more information about the possible culprit. Traditionally, X clients and X servers share an authentication cookie (see the XAUTHORITY
environment variable, which can point to e.g. ~/.Xauthority
, or the xauth
command line tool), called MIT-MAGIC-COOKIE-1
, which requires some bits of randomness. It had been reported that some desktop session programs rely on libICE
to handle that cookie, which uses a function from the libbsd
library, arc4random_buf()
, which in turn uses the getrandom()
system call exposed by the kernel.
That getrandom()
syscall was recently modified as part of a security fix, and this made callers block more easily when not enough entropy has been gathered. A parameter/an option can be specified to make this call non-blocking, but it would then return an error immediately instead of blocking. More information about this can be found in the Fixing Linux getrandom() in stable thread on the debian-release@
mailing list, but basically the idea was to get libbsd
fixed in unstable/testing
rather than simply reverting the security fix (the latter was preferred as a follow-up to the security updates for stable
which were triggering severe regressions).
The installer team hadn’t checked that libICE
was actually used by the installer, but waiting for the libbsd
fix to rebuild and recheck was deemed like a sensible course of action. Unfortunately, right after a first bug fix upload was accepted into unstable
, no changes could be noticed in a freshly rebuilt installer…
Deeper investigations
A quick reality check revealed that the absence of changes was to be expected, as the graphical installer doesn’t use the libICE
library, which doesn’t even ship any udebs…
Back to square one/screen black: debconf
was being reported as the process affected by the blocking behaviour of the getrandom()
call so the first step was to chase down randomness-related calls in its source code. Unfortunately nothing turned up…
Next try was diving into the gtk+2.0
source code, since that’s the graphical toolkit used by the graphical installer. That looked promising at first since there were a few calls to g_random_int()
to initialize the internal stamp
field for some data structures. That function is defined in the glib1.0
package (see glib/grand.c
), and trying to get rid of the calls from gtk+2.0
made no differences at all…
At this point, it wasn’t clear which part was responsible for the calls to getrandom()
, so Cyril went for a brute force approach. A new custom build of the graphical installer was prepared, including the strace
tool (which traces syscalls); and modifying the start-up scripts to wrap debconf
’s execution with strace
. The rootskel
package ships this line in /lib/debian-installer/menu
:
exec debconf -o d-i $MENU
which was modified by a sed
call in the main Makefile of the debian-installer
package (build/Makefile
):
sed 's,exec debconf,exec strace -v -f -s 400 -o /tmp/debconf.strace debconf,' -i "$(TREE)/lib/debian-installer/menu"
This resulted in strace
writing a /tmp/debconf.strace
file which could then be used to debug this further. Since the Debian Installer comes with little debugging tools, this file can be extracted using a simple network pipe, with nc
/netcat
.
A new hope
Searching for getrandom()
calls, they were apparently appearing right after opening the /etc/fonts/fonts.conf
file, but also while opening other files under the /etc/fonts/fonts.d/
directory, which contains configuration snippets for fontconfig
. Searching its source code, it appeared two files could be relevant:
src/fccache.c
which callsuuid_generate_random()
;src/fccompat.c
defines anFcRandom()
function which leveragesrandom_r()
if detected through theconfigure
script (HAVE_RANDOM_R
).
Adding a few fprintf(stderr, "...\n");
statements around all suspicious calls made it clear: the uuid_generate_random()
calls (implemented in libuuid1
/libuuid1-udeb
, packages built from the util-linux
source) are responsible for the delays. Let’s look at the implementation in lib/randutils.c
, the UUID generation calling get_random_bytes()
in the same file:
/*
* Generate a stream of random nbytes into buf.
* Use /dev/urandom if possible, and if not,
* use glibc pseudo-random functions.
*/
#define UL_RAND_READ_ATTEMPTS 8
#define UL_RAND_READ_DELAY 125000 /* microseconds */
void random_get_bytes(void *buf, size_t nbytes)
{
unsigned char *cp = (unsigned char *)buf;
size_t i, n = nbytes;
int lose_counter = 0;
#ifdef HAVE_GETRANDOM
while (n > 0) {
int x;
errno = 0;
x = getrandom(cp, n, GRND_NONBLOCK);
if (x > 0) { /* success */
n -= x;
cp += x;
lose_counter = 0;
} else if (errno == ENOSYS) { /* kernel without getrandom() */
break;
} else if (errno == EAGAIN && lose_counter < UL_RAND_READ_ATTEMPTS) {
xusleep(UL_RAND_READ_DELAY); /* no etropy, wait and try again */
lose_counter++;
} else
break;
}
if (errno == ENOSYS)
#endif
/*
* We've been built against headers that support getrandom, but the
* running kernel does not. Fallback to reading from /dev/{u,}random
* as before
*/
{
int fd = random_get_fd();
/* … */
So if the HAVE_GETRANDOM
macro was defined at compile-time, a getrandom()
call is attempted, in non-blocking mode. If the ENOSYS
error is triggered, the break;
gets us out of the loop, and the code carries one with a fallback to reading from /dev/urandom
or /dev/random
(one gets the relevant file descriptor with random_get_fd()
as seen on the last line.
But if the error is EAGAIN
instead, that means not enough entropy was available. In this case, there’s a little delay getting added through xusleep()
, and further attempts are made until a given limit (UL_RAND_READ_ATTEMPTS
) is reached. At this point, the final break;
triggers an exit of the loop and the fallback described above is reached.
Where to go from here?
To make extra sure diagnostics were correct, a little patch was developed as a proof of concept, disabling the HAVE_GETRANDOM
macro, so that the while
loop quoted above isn’t even tried, and so that one jumps directly to the fallback reading from /dev/*random
. It was a bit tricky to get this part right since there are several places where it can be enabled: a configure.ac
check, another check on the __NR_getrandom
macro, and a final check on the SYS_getrandom
macro.
--- a/configure.ac
+++ b/configure.ac
@@ -462,7 +462,6 @@ AC_CHECK_FUNCS([ \\
getdtablesize \\
getexecname \\
getmntinfo \\
- getrandom \\
getrlimit \\
getsgnam \\
inotify_init \\
--- a/lib/randutils.c
+++ b/lib/randutils.c
@@ -27,14 +27,14 @@
#ifdef HAVE_GETRANDOM
# include <sys/random.h>
-#elif defined (__linux__)
+#elif 0
# if !defined(SYS_getrandom) && defined(__NR_getrandom)
/* usable kernel-headers, but old glibc-headers */
# define SYS_getrandom __NR_getrandom
# endif
#endif
-#if !defined(HAVE_GETRANDOM) && defined(SYS_getrandom)
+#if 0
/* libc without function, but we have syscal */
#define GRND_NONBLOCK 0x01
#define GRND_RANDOM 0x02
This patch ensures a direct read from /dev/*random
, which means delays go away, and that Debian Installer starts properly. Hurray?
Unfortunately, it doesn’t seem reasonable to degrade the quality of randomness used by a function called uuid_generate_random()
, since it might be used by something a little more critical than setting UUIDs for fonts in fontconfig
; a quick search on sources.debian.org returns 58 packages with that name in their source code at the very least. Increasing this function’s complexity to add an I don’t care about entropy quality
option doesn’t seem too good either. So maybe it might make sense to see if fontconfig
might be tweaked to use something that wouldn’t rely on getrandom()
…
Time to report these findings to the Debian Installer team as a whole, adding fontconfig
and util-linux
developers to the loop. Let’s see what consensus we can reach.
Update (June 2018)
Ben Hutchings saved the day one more time, checking my status report and quickly pointing out that the same issue was encountered with plymouth as well. That component also runs very early in the boot process, to provide with a graphical “splash” screen.
Borrowing the fix from plymouth was very easy, as it was only about creating the fontconfig cache at build-time thanks to an fc-cache
call; and that finally made this issue vanish!
There was still a little issue left to fix, as the X server had been updated in the meanwhile: a transition to a major version of the X stack had started in unstable
in the meanwhile, and there was a regression with the fbdev
driver. The server would fail to start with:
Xorg: symbol lookup error: /usr/lib/xorg/modules/drivers/fbdev_drv.so: undefined symbol: shadowUpdatePackedWeak
The Debian Installer side was tracked in #900317, even if the actual issue was first filed as #900613 against the xserver-xorg-video-fbdev
package (also affecting its xserver-xorg-video-fbdev-udeb
counterpart).
Having very responsive X people helped Cyril realize the patch was easy to grab and test, and that finally concluded the treasure hunt to repair the graphical installer! Two weeks after that, the Debian Installer Buster Alpha 3 release was published.