A few weeks ago, other developers reported on the #debian-boot
IRC channel that they were seeing
regressions while testing daily builds of the graphical version of the Debian Installer. This could easily be
reproduced by anyone, fetching the latest netboot/gtk/mini.iso
file for the amd64
architecture, which can be found in the directory of daily
builds for amd64.
Timing-wise, it might appear that coincided with the move of all our git repositories from the good, old
alioth.debian.org
hosting service to the brand new salsa.debian.org
, the Gitlab
instance replacing the previous Fusionforge one. But that transition had nothing to do with our issue, which is
rather related to our move from the 4.15 kernel series to the 4.16 one.
Looking at that black screen, it looked like the X server was starting just fine, but nothing was being drawn,
and input devices (keyboard, mouse) appeared inactive. Thankfully, the QEMU/KVM monitor makes it possible to
trigger a switch to a different TTY, to access log files and run a few commands. It was possible to extract
kernel logs and X logs to compare what was happening with different kernel versions. The big surprise was: the
logs were identical, except for the uname
output and the timestamps embedded in the logs. Switching
back to the TTY where X was expected to show up, here it was, properly started and displaying the usual language
selection screen! Moving from X logs to system logs (including kernel logs):
May 12 02:25:02 kernel: [ 3.302164] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302291] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302351] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302450] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302530] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302682] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302804] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302963] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.303009] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.303078] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:19 init: starting pid 170, tty '/dev/tty2': '-/bin/sh' May 12 02:25:21 debconf: Setting debconf/language to en May 12 02:25:21 main-menu[229]: DEBUG: resolver (libgcc1): package doesn't exist (ignored) May 12 02:25:21 main-menu[229]: INFO: Falling back to the package description for brltty-udeb May 12 02:25:21 main-menu[229]: INFO: Menu item 'localechooser' selected May 12 02:25:22 debconf: Setting debconf/language to en May 12 02:25:22 gtk-set-font: Switching to font 'DejaVu Sans' for 'en' May 12 02:25:24 kernel: [ 25.303708] random: crng init done
So it appears there were randomness/entropy-related issues very soon after boot-up; then, roughly 20 seconds
later, debconf
(which drives the installation process, asking questions and waiting for answers) was
properly started; closely followed by the kernel reporting some “random” initialization had finished. This rang a
bell so I’ve reported my findings as #898468, putting the Debian
kernel team in copy.
Ben Hutchings swiftly replied with some more information about
the possible culprit. Traditionally, X clients and X servers share an authentication cookie (see the
XAUTHORITY
environment variable, which can point to e.g. ~/.Xauthority
, or the
xauth
command line tool), called MIT-MAGIC-COOKIE-1
, which requires some bits of
randomness. It had been reported that some desktop session programs rely on libICE
to handle that
cookie, which uses a function from the libbsd
library, arc4random_buf()
, which in turn
uses the getrandom()
system call exposed by the kernel.
That getrandom()
syscall was recently modified as part of a security fix, and this made callers
block more easily when not enough entropy has been gathered. A parameter/an option can be specified to make this
call non-blocking, but it would then return an error immediately instead of blocking. More information about this
can be found in the Fixing Linux
getrandom() in stable thread on the debian-release@
mailing list, but basically the idea was to
get libbsd
fixed in unstable/testing
rather than simply reverting the security fix (the
latter was preferred as a follow-up to the security updates for stable
which were triggering severe
regressions).
The installer team hadn’t checked that libICE
was actually used by the installer, but waiting for
the libbsd
fix to rebuild and recheck was deemed like a sensible course of action. Unfortunately,
right after a first
bug fix upload was accepted into unstable
, no changes could be noticed in a freshly rebuilt
installer…
A quick reality check revealed that the absence of changes was to be expected, as the graphical installer
doesn’t use the libICE
library, which doesn’t even ship any udebs…
Back to square one/screen black: debconf
was being reported as the process affected by the
blocking behaviour of the getrandom()
call so the first step was to chase down randomness-related
calls in its source code. Unfortunately nothing turned up…
Next try was diving into the gtk+2.0
source code, since that’s the graphical toolkit used by the
graphical installer. That looked promising at first since there were a few calls to g_random_int()
to initialize the internal stamp
field for some data structures. That function is defined in the
glib1.0
package (see glib/grand.c
), and trying to get rid of the calls from
gtk+2.0
made no differences at all…
At this point, it wasn’t clear which part was responsible for the calls to getrandom()
, so Cyril
went for a brute force approach. A new custom build of the graphical installer was prepared, including the
strace
tool (which traces syscalls); and modifying the start-up scripts to wrap
debconf
’s execution with strace
. The rootskel
package ships this line in
/lib/debian-installer/menu
:
exec debconf -o d-i $MENU
which was modified by a sed
call in the main Makefile of the debian-installer
package (build/Makefile
):
sed 's,exec debconf,exec strace -v -f -s 400 -o /tmp/debconf.strace debconf,' -i "/lib/debian-installer/menu"
This resulted in strace
writing a /tmp/debconf.strace
file which could then be used
to debug this further. Since the Debian Installer comes with little debugging tools, this file can be extracted
using a simple network pipe, with nc
/netcat
.
Searching for getrandom()
calls, they were apparently appearing right after opening the
/etc/fonts/fonts.conf
file, but also while opening other files under the
/etc/fonts/fonts.d/
directory, which contains configuration snippets for fontconfig
.
Searching its source code, it appeared two files could be relevant:
src/fccache.c
which calls uuid_generate_random()
;src/fccompat.c
defines an FcRandom()
function which leverages
random_r()
if detected through the configure
script
(HAVE_RANDOM_R
).Adding a few fprintf(stderr, "...\n");
statements around all suspicious calls made it clear: the
uuid_generate_random()
calls (implemented in libuuid1
/libuuid1-udeb
,
packages built from the util-linux
source) are responsible for the delays. Let’s look at the
implementation in lib/randutils.c
, the UUID generation calling get_random_bytes()
in
the same file:
/*
* Generate a stream of random nbytes into buf.
* Use /dev/urandom if possible, and if not,
* use glibc pseudo-random functions.
*/
#define UL_RAND_READ_ATTEMPTS 8
#define UL_RAND_READ_DELAY 125000 /* microseconds */
void random_get_bytes(void *buf, size_t nbytes)
{
unsigned char *cp = (unsigned char *)buf;
size_t i, n = nbytes;
int lose_counter = 0;
#ifdef HAVE_GETRANDOM
while (n > 0) {
int x;
errno = 0;
x = getrandom(cp, n, GRND_NONBLOCK);
if (x > 0) { /* success */
n -= x;
cp += x;
lose_counter = 0;
} else if (errno == ENOSYS) { /* kernel without getrandom() */
break;
} else if (errno == EAGAIN && lose_counter < UL_RAND_READ_ATTEMPTS) {
xusleep(UL_RAND_READ_DELAY); /* no etropy, wait and try again */
lose_counter++;
} else
break;
}
if (errno == ENOSYS)
#endif
/*
* We've been built against headers that support getrandom, but the
* running kernel does not. Fallback to reading from /dev/{u,}random
* as before
*/
{
int fd = random_get_fd();
/* … */
So if the HAVE_GETRANDOM
macro was defined at compile-time, a getrandom()
call is
attempted, in non-blocking mode. If the ENOSYS
error is triggered, the break;
gets us
out of the loop, and the code carries one with a fallback to reading from /dev/urandom
or
/dev/random
(one gets the relevant file descriptor with random_get_fd()
as seen on the
last line.
But if the error is EAGAIN
instead, that means not enough entropy was available. In this case,
there’s a little delay getting added through xusleep()
, and further attempts are made until a given
limit (UL_RAND_READ_ATTEMPTS
) is reached. At this point, the final break;
triggers an
exit of the loop and the fallback described above is reached.
To make extra sure diagnostics were correct, a little patch was developed as a proof of concept, disabling the
HAVE_GETRANDOM
macro, so that the while
loop quoted above isn’t even tried, and so that
one jumps directly to the fallback reading from /dev/*random
. It was a bit tricky to get this part
right since there are several places where it can be enabled: a configure.ac
check, another check on
the __NR_getrandom
macro, and a final check on the SYS_getrandom
macro.
--- a/configure.ac
+++ b/configure.ac
@@ -462,7 +462,6 @@ AC_CHECK_FUNCS([ \\
getdtablesize \\
getexecname \\
getmntinfo \\
- getrandom \\
getrlimit \\
getsgnam \\
inotify_init \\
--- a/lib/randutils.c
+++ b/lib/randutils.c
@@ -27,14 +27,14 @@
-#elif defined (__linux__)
+#elif 0
/* usable kernel-headers, but old glibc-headers */
-#if !defined(HAVE_GETRANDOM) && defined(SYS_getrandom)
+#if 0
/* libc without function, but we have syscal */
This patch ensures a direct read from /dev/*random
, which means delays go away, and that Debian
Installer starts properly. Hurray?
Unfortunately, it doesn’t seem reasonable to degrade the quality of randomness used by a function called
uuid_generate_random()
, since it might be used by something a little more critical than setting
UUIDs for fonts in fontconfig
; a quick search on sources.debian.org returns 58 packages with that name in their source code at
the very least. Increasing this function’s complexity to add an I don’t care about entropy quality
option
doesn’t seem too good either. So maybe it might make sense to see if fontconfig
might be tweaked to
use something that wouldn’t rely on getrandom()
…
Time to report these findings to the Debian Installer team as a whole, adding fontconfig
and
util-linux
developers to the loop. Let’s see what consensus we can reach.
Ben Hutchings saved the day one more time, checking my status report and quickly pointing out that the same issue was encountered with plymouth as well. That component also runs very early in the boot process, to provide with a graphical “splash” screen.
Borrowing the fix from plymouth was very easy, as it was only about creating
the fontconfig cache at build-time thanks to an fc-cache
call; and that finally made this issue
vanish!
There was still a little issue left to fix, as the X server had been updated in the meanwhile: a transition to
a major version of the X stack had started in unstable
in the meanwhile, and there was a regression
with the fbdev
driver. The server would fail to start with:
Xorg: symbol lookup error: /usr/lib/xorg/modules/drivers/fbdev_drv.so: undefined symbol: shadowUpdatePackedWeak
The Debian Installer side was tracked in #900317, even if the
actual issue was first filed as #900613 against the
xserver-xorg-video-fbdev
package (also affecting its xserver-xorg-video-fbdev-udeb
counterpart).
Having very responsive X people helped Cyril realize the patch was easy to grab and test, and that finally concluded the treasure hunt to repair the graphical installer! Two weeks after that, the Debian Installer Buster Alpha 3 release was published.