Ce billet n’a pas encore été traduit en français. La version anglaise est disponible ci-dessous.
What was noticed?
A few weeks ago, other developers reported on
the #debian-boot
IRC channel that they were seeing
regressions while testing daily builds of the graphical version of
the Debian Installer. This could easily be reproduced by anyone,
fetching the latest netboot/gtk/mini.iso
file for the
amd64
architecture, which can be found in the
directory
of daily
builds for amd64.
Timing-wise, it might appear that coincided with the move of all
our git repositories from the good,
old alioth.debian.org
hosting service to the brand
new salsa.debian.org
, the Gitlab instance replacing
the previous Fusionforge one. But that transition had nothing to
do with our issue, which is rather related to our move from the
4.15 kernel series to the 4.16 one.
First investigation
Looking at that black screen, it looked like the X server was
starting just fine, but nothing was being drawn, and input devices
(keyboard, mouse) appeared inactive. Thankfully, the QEMU/KVM
monitor makes it possible to trigger a switch to a different TTY,
to access log files and run a few commands. It was possible to
extract kernel logs and X logs to compare what was happening with
different kernel versions. The big surprise was: the logs were
identical, except for the uname
output and
the timestamps embedded in the logs. Switching back to the TTY
where X was expected to show up, here it was, properly started and
displaying the usual language selection screen! Moving from X logs
to system logs (including kernel logs):
May 12 02:25:02 kernel: [ 3.302164] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302291] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302351] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302450] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302530] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302682] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302804] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.302963] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.303009] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:02 kernel: [ 3.303078] random: debconf: uninitialized urandom read (8 bytes read) May 12 02:25:19 init: starting pid 170, tty '/dev/tty2': '-/bin/sh' May 12 02:25:21 debconf: Setting debconf/language to en May 12 02:25:21 main-menu[229]: DEBUG: resolver (libgcc1): package doesn't exist (ignored) May 12 02:25:21 main-menu[229]: INFO: Falling back to the package description for brltty-udeb May 12 02:25:21 main-menu[229]: INFO: Menu item 'localechooser' selected May 12 02:25:22 debconf: Setting debconf/language to en May 12 02:25:22 gtk-set-font: Switching to font 'DejaVu Sans' for 'en' May 12 02:25:24 kernel: [ 25.303708] random: crng init done
So it appears there were randomness/entropy-related issues very
soon after boot-up; then, roughly 20 seconds
later, debconf
(which drives the installation
process, asking questions and waiting for answers) was properly
started; closely followed by the kernel reporting some “random”
initialization had finished. This rang a bell so
I’ve reported my findings
as #898468, putting the Debian kernel team in copy.
Ben Hutchings swiftly
replied with some more information about the possible
culprit. Traditionally, X clients and X servers share an
authentication cookie (see the XAUTHORITY
environment variable, which can point to
e.g. ~/.Xauthority
, or the xauth
command line tool), called MIT-MAGIC-COOKIE-1
,
which requires some bits of randomness. It had been reported
that some desktop session programs rely on libICE
to handle that cookie, which uses a function from
the libbsd
library,
arc4random_buf()
, which in turn uses the
getrandom()
system call exposed by the kernel.
That getrandom()
syscall was recently modified as
part of a security fix, and this made callers block more easily
when not enough entropy has been gathered. A parameter/an option
can be specified to make this call non-blocking, but it would then
return an error immediately instead of blocking. More information
about this can be found in
the Fixing
Linux getrandom() in stable thread on
the debian-release@
mailing list, but basically the
idea was to get libbsd
fixed
in unstable/testing
rather than simply reverting the
security fix (the latter was preferred as a follow-up to the
security updates for stable
which were triggering severe regressions).
The installer team hadn’t checked that libICE
was
actually used by the installer, but waiting for
the libbsd
fix to rebuild and recheck was deemed like
a sensible course of action. Unfortunately, right after
a first
bug fix upload was accepted into unstable
, no
changes could be noticed in a freshly rebuilt installer…
Deeper investigations
A quick reality check revealed that the absence of changes was to
be expected, as the graphical installer doesn’t use
the libICE
library, which doesn’t even ship any
udebs…
Back to square one/screen black: debconf
was being
reported as the process affected by the blocking behaviour of
the getrandom()
call so the first step was to chase
down randomness-related calls in its source code. Unfortunately
nothing turned up…
Next try was diving into the gtk+2.0
source code,
since that’s the graphical toolkit used by the graphical
installer. That looked promising at first since there were a few
calls to g_random_int()
to initialize the
internal stamp
field for some data structures. That
function is defined in the glib1.0
package
(see glib/grand.c
), and trying to get rid of the
calls from gtk+2.0
made no differences at all…
At this point, it wasn’t clear which part was responsible for the
calls to getrandom()
, so Cyril went for a brute force
approach. A new custom build of the graphical installer was
prepared, including the strace
tool (which traces
syscalls); and modifying the start-up scripts to
wrap debconf
’s execution
with strace
. The rootskel
package ships
this line in /lib/debian-installer/menu
:
exec debconf -o d-i $MENU
which was modified by a sed
call in the main Makefile
of the debian-installer
package
(build/Makefile
):
sed 's,exec debconf,exec strace -v -f -s 400 -o /tmp/debconf.strace debconf,' -i "$(TREE)/lib/debian-installer/menu"
This resulted in strace
writing
a /tmp/debconf.strace
file which could then be used
to debug this further. Since the Debian Installer comes with
little debugging tools, this file can be extracted using a simple
network pipe, with nc
/netcat
.
A new hope
Searching
for getrandom()
calls, they were apparently appearing
right after opening the /etc/fonts/fonts.conf
file,
but also while opening other files under
the /etc/fonts/fonts.d/
directory, which contains
configuration snippets for fontconfig
. Searching its source code, it appeared two files could be relevant:
src/fccache.c
which callsuuid_generate_random()
;src/fccompat.c
defines anFcRandom()
function which leveragesrandom_r()
if detected through theconfigure
script (HAVE_RANDOM_R
).
Adding a few fprintf(stderr, "...\n");
statements
around all suspicious calls made it clear:
the uuid_generate_random()
calls (implemented
in libuuid1
/libuuid1-udeb
, packages
built from the util-linux
source) are responsible for
the delays. Let’s look at the implementation
in lib/randutils.c
, the UUID generation
calling get_random_bytes()
in the same file:
/*
* Generate a stream of random nbytes into buf.
* Use /dev/urandom if possible, and if not,
* use glibc pseudo-random functions.
*/
#define UL_RAND_READ_ATTEMPTS 8
#define UL_RAND_READ_DELAY 125000 /* microseconds */
void random_get_bytes(void *buf, size_t nbytes)
{
unsigned char *cp = (unsigned char *)buf;
size_t i, n = nbytes;
int lose_counter = 0;
#ifdef HAVE_GETRANDOM
while (n > 0) {
int x;
errno = 0;
x = getrandom(cp, n, GRND_NONBLOCK);
if (x > 0) { /* success */
n -= x;
cp += x;
lose_counter = 0;
} else if (errno == ENOSYS) { /* kernel without getrandom() */
break;
} else if (errno == EAGAIN && lose_counter < UL_RAND_READ_ATTEMPTS) {
xusleep(UL_RAND_READ_DELAY); /* no etropy, wait and try again */
lose_counter++;
} else
break;
}
if (errno == ENOSYS)
#endif
/*
* We've been built against headers that support getrandom, but the
* running kernel does not. Fallback to reading from /dev/{u,}random
* as before
*/
{
int fd = random_get_fd();
/* … */
So if the HAVE_GETRANDOM
macro was defined at
compile-time, a getrandom()
call is attempted, in
non-blocking mode. If the ENOSYS
error is triggered,
the break;
gets us out of the loop, and the code
carries one with a fallback to reading
from /dev/urandom
or /dev/random
(one
gets the relevant file descriptor
with random_get_fd()
as seen on the last line.
But if the error is EAGAIN
instead, that means not
enough entropy was available. In this case, there’s a little delay
getting added through xusleep()
, and further attempts
are made until a given limit (UL_RAND_READ_ATTEMPTS
)
is reached. At this point, the final break;
triggers
an exit of the loop and the fallback described above is reached.
Where to go from here?
To make extra sure diagnostics were correct, a little patch was
developed as a proof of concept, disabling
the HAVE_GETRANDOM
macro, so that
the while
loop quoted above isn’t even tried, and so
that one jumps directly to the fallback reading
from /dev/*random
. It was a bit tricky to get this
part right since there are several places where it can be enabled:
a configure.ac
check, another check on
the __NR_getrandom
macro, and a final check on
the SYS_getrandom
macro.
--- a/configure.ac
+++ b/configure.ac
@@ -462,7 +462,6 @@ AC_CHECK_FUNCS([ \\
getdtablesize \\
getexecname \\
getmntinfo \\
- getrandom \\
getrlimit \\
getsgnam \\
inotify_init \\
--- a/lib/randutils.c
+++ b/lib/randutils.c
@@ -27,14 +27,14 @@
#ifdef HAVE_GETRANDOM
# include <sys/random.h>
-#elif defined (__linux__)
+#elif 0
# if !defined(SYS_getrandom) && defined(__NR_getrandom)
/* usable kernel-headers, but old glibc-headers */
# define SYS_getrandom __NR_getrandom
# endif
#endif
-#if !defined(HAVE_GETRANDOM) && defined(SYS_getrandom)
+#if 0
/* libc without function, but we have syscal */
#define GRND_NONBLOCK 0x01
#define GRND_RANDOM 0x02
This patch ensures a direct read from /dev/*random
,
which means delays go away, and that Debian Installer starts
properly. Hurray?
Unfortunately, it doesn’t seem reasonable to degrade the quality
of randomness used by a function
called uuid_generate_random()
, since it might be used
by something a little more critical than setting UUIDs for fonts in fontconfig
;
a quick search
on sources.debian.org
returns 58 packages with that name in their source code at the
very least. Increasing this function’s complexity to add an I
don’t care about entropy quality
option doesn’t seem too good
either. So maybe it might make sense to see
if fontconfig
might be tweaked to use something that
wouldn’t rely on getrandom()
…
Time to report these findings to the Debian Installer team as a
whole, adding fontconfig
and util-linux
developers to the loop. Let’s see what consensus we can reach.
Update (June 2018)
Ben Hutchings saved the day one more time, checking my status report and quickly pointing out that the same issue was encountered with plymouth as well. That component also runs very early in the boot process, to provide with a graphical “splash” screen.
Borrowing the fix from plymouth was very easy, as it was only about
creating
the fontconfig cache at build-time thanks to
an fc-cache
call; and that finally made this issue
vanish!
There was still a little issue left to fix, as the X server had
been updated in the meanwhile: a transition to a major version of
the X stack had started in unstable
in the meanwhile,
and there was a regression with the fbdev
driver. The
server would fail to start with:
Xorg: symbol lookup error: /usr/lib/xorg/modules/drivers/fbdev_drv.so: undefined symbol: shadowUpdatePackedWeak
The Debian Installer side was tracked
in #900317, even if
the actual issue was first filed
as #900613 against
the xserver-xorg-video-fbdev
package (also affecting
its xserver-xorg-video-fbdev-udeb
counterpart).
Having very responsive X people helped Cyril realize the patch was easy to grab and test, and that finally concluded the treasure hunt to repair the graphical installer! Two weeks after that, the Debian Installer Buster Alpha 3 release was published.