Navigation: blog > 2020 RSS

Explore a single month: 02 03 06

systemd: RequiredBy versus WantedBy

systemd metadata

Introduction

Debamax helps several customers build and maintain system images based on Debian: those are deployed on target devices (regular servers, workstations, embedded devices), and then upgraded to new versions of the custom operating system using secure channels, and validated upgraded paths.

Such systems usually require some system-level integration, making sure all required packages work well when installed together, along with some configuration daemon to ensure those sytems can be tweaked as required. Such configuration daemons can be controlled remotely over a so-called “business application” that takes care of establishing a link to some remote backoffice, or exposed on a local console for operators to configure the system in a controlled manner (as opposed to exposing a full root shell).

This article focuses on the implementation details of such a configuration daemon, which is split into two parts: the actual daemon is exposing a REST interface, which is used by the business application to trigger configuration updates, and another part which is handling upgrades. To abstract this article from internal names, those two parts are called debamax-daemon and debamax-upgrade respectively. The following graph highlights the possible interactions.

Interactions between daemon and upgrade components
Interactions between units

Each part is managed by a systemd service unit:

Through some heavily simplified code, this article highlights the need for accurate metadata in those two systemd units, namely RequiredBy= versus WantedBy= in the [Install] section.

The playground is a minimal Debian 10 (Buster) virtual machine. To simplify tests, no actual debamax-* Debian packages are involved (even though that's the case for customer systems), and directories for local administrators are used instead of system-wide directories (which are for proper packages). This means:

Important: In the whole article, the following convention will be used: when quoting commands being run, a line with 4 hyphens (----) will separate the output of the command from the system logs (as seen through journalctl). Letting a journalctl -f run in the background or in a different console makes it easy to follow what's happening.

Looking at the upgrade script

Naive attempt

When starting from scratch, rather than patching an existing systemd unit, it might seem appealing to look around in all /lib/systemd/system/*.service, find something that resembles what we'd like to achieve, and adapt it. A minimal systemd service unit (/etc/systemd/system/debamax-upgrade.service) might look like this:

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/debamax-upgrade

[Install]
RequiredBy=multi-user.target

and a minimal upgrade script (/usr/local/sbin/debamax-upgrade, don't forget the +x flag) might be:

#!/bin/sh
echo "starting debamax-upgrade"
echo "stopping debamax-upgrade"

Let's enable it now, meaning enable and start in a single command:

root@demo:~# systemctl enable --now debamax-upgrade
Created symlink /etc/systemd/system/multi-user.target.requires/debamax-upgrade.service \
  → /etc/systemd/system/debamax-upgrade.service.
----
Jun 13 17:01:16 demo systemd[1]: Reloading.
Jun 13 17:01:16 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:01:16 demo debamax-upgrade[881]: starting debamax-upgrade
Jun 13 17:01:16 demo debamax-upgrade[881]: stopping debamax-upgrade
Jun 13 17:01:16 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:01:16 demo systemd[1]: Started debamax-upgrade.service.

The RequiredBy=multi-user.target is translated to a symlink pointing to the appropriate systemd service unit, systemd reloads its configuration on its own (no need for a separate systemctl daemon-reload), and the unit is started.

All good? Not quite!

Actual testing

Now we have a debamax-upgrade script that was started once, and that will start at boot-up, which is rather good. But what if the daemon needs to request an upgrade? Restarting the debamax-upgrade.service unit seems the right thing to do:

root@demo:~# systemctl restart debamax-upgrade
----
Jun 13 17:04:05 demo systemd[1]: Stopped target Graphical Interface.
Jun 13 17:04:05 demo systemd[1]: Stopping Graphical Interface.
Jun 13 17:04:05 demo systemd[1]: Stopped target Multi-User System.
Jun 13 17:04:05 demo systemd[1]: Stopping Multi-User System.
Jun 13 17:04:05 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:04:05 demo debamax-upgrade[892]: starting debamax-upgrade
Jun 13 17:04:05 demo debamax-upgrade[892]: stopping debamax-upgrade
Jun 13 17:04:05 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:04:05 demo systemd[1]: Started debamax-upgrade.service.
Jun 13 17:04:05 demo systemd[1]: Reached target Multi-User System.
Jun 13 17:04:05 demo systemd[1]: Reached target Graphical Interface.
Jun 13 17:04:05 demo systemd[1]: Starting Update UTMP about System Runlevel Changes...
Jun 13 17:04:05 demo systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Jun 13 17:04:05 demo systemd[1]: Started Update UTMP about System Runlevel Changes.

Wait a minute! What's with those Graphical and Multi-User targets? Even if there are no display managers, graphical.target is the default target, which depends on multi-user.target, as can be seen in the bootup manpage. But why are those two targets getting exited from?

It was requested to restart a service unit that's RequiredBy the multi-user.target, which means their requirements are no longer met. Once the service unit starts again, those targets can be re-entered, and some side-effects can be seen (UTMP, System Runlevel Changes).

Since this is about a shell script that runs once and exits, as defined through Type=oneshot, one might wonder what happens if only a new start would be requested:

root@demo:~# systemctl start debamax-upgrade
----
Jun 13 17:05:09 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:05:09 demo debamax-upgrade[899]: starting debamax-upgrade
Jun 13 17:05:09 demo debamax-upgrade[899]: stopping debamax-upgrade
Jun 13 17:05:09 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:05:09 demo systemd[1]: Started debamax-upgrade.service.

That looks way better, and closer to what was intended. To be entirely honest, an early “fix” was to just spawn systemctl start debamax-upgrade from the debamax-daemon code when requesting an upgrade, but that looked fishy. Can we fix the restart case that looked awkward?

Towards a real fix

Let's switch the systemd service unit from RequiredBy= to WantedBy=. Beware, while doing so, it's best to first disable the current unit, edit it, and then enable it again, so that the proper symlinks can be removed and created:

root@demo:~# systemctl disable --now debamax-upgrade
Removed /etc/systemd/system/multi-user.target.requires/debamax-upgrade.service.

root@demo:~# sed 's/RequiredBy=/WantedBy=/' -i /etc/systemd/system/debamax-upgrade.service

root@demo:~# systemctl enable --now debamax-upgrade
Created symlink /etc/systemd/system/multi-user.target.wants/debamax-upgrade.service \
  → /etc/systemd/system/debamax-upgrade.service.
----
Jun 13 17:06:12 demo systemd[1]: Reloading.
…
Jun 13 17:06:42 demo systemd[1]: Reloading.
Jun 13 17:06:43 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:06:43 demo debamax-upgrade[936]: starting debamax-upgrade
Jun 13 17:06:43 demo debamax-upgrade[936]: stopping debamax-upgrade
Jun 13 17:06:43 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:06:43 demo systemd[1]: Started debamax-upgrade.service.

For the avoidance of doubt, here's the new version of the systemd service unit for the upgrade component (/etc/systemd/system/debamax-upgrade.service):

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/debamax-upgrade

[Install]
WantedBy=multi-user.target

Let's compare starting:

root@demo:~# systemctl start debamax-upgrade
----
Jun 13 17:07:48 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:07:48 demo debamax-upgrade[941]: starting debamax-upgrade
Jun 13 17:07:48 demo debamax-upgrade[941]: stopping debamax-upgrade
Jun 13 17:07:48 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:07:48 demo systemd[1]: Started debamax-upgrade.service.

with restarting:

root@demo:~# systemctl restart debamax-upgrade
----
Jun 13 17:08:16 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:08:16 demo debamax-upgrade[944]: starting debamax-upgrade
Jun 13 17:08:16 demo debamax-upgrade[944]: stopping debamax-upgrade
Jun 13 17:08:16 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:08:16 demo systemd[1]: Started debamax-upgrade.service.

OK, this is way better!

Looking at the daemon

More context

Let's backpedal a bit: it was confessed earlier that an early “fix” was just to switch from restart to start for the debamax-upgrade.service unit. It was noted in a ticket “for later” that some strange interactions across targets would otherwise be happening, but an easy fix was available. Why was it so important to get back to this topic to find a proper fix?

The problem was spotted again in a different area… This custom operating system features 3 critical components:

The internal names for the business and support applications have been replaced with demo-service-a and demo-service-b in this article.

To free up some resources during upgrades, and to avoid possible ups and downs as those applications get restarted during upgrades, it was decided to just shut them down prior to upgrades. A successful upgrade would trigger a reboot and let the usual boot-up sequence start everyone again, while failed upgrades would restart services manually.

Naive attempt

Let's see what minimal systemd unit configuration could look like for the configuration daemon and for the two demo services:

Let's keep all three daemons to something very minimal, waiting a full day for something to happen, ensuring they keep running for a while. The aim is simulating long-running daemons (Type=simple) with trivial code, instead of just having a single shell script exit early (which would usually be marked as Type=oneshot):

Now, let's check a new version of the upgrade script, that implements the policy described above: stop demo services first, then trigger an upgrade. To simply things further, the upgrade is simulated as well. What would normally happen when a package containing a daemon is upgraded would be putting new files in place, and restarting the daemon. Here, the upgrade simulation is only about restarting the debamax-daemon service unit.

Note: At this stage, both versions of the debamax-upgrade.service unit would give the same results. The debamax-upgrade script could even be started manually from the shell (without involving systemctl at all). For simplicity's sake, the systemctl start debamax-upgrade call, that works with both versions, was chosen.

Actual testing

What happens when we start the upgrade process?

root@demo:~# systemctl start debamax-upgrade
----
Jun 13 17:28:47 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:28:47 demo debamax-upgrade[1272]: starting debamax-upgrade
Jun 13 17:28:47 demo debamax-upgrade[1272]: 1. stopping some services during the upgrade
Jun 13 17:28:47 demo demo-service-a[1263]: Terminated
Jun 13 17:28:47 demo demo-service-a[1263]: stopping service A
Jun 13 17:28:47 demo systemd[1]: Stopping demo-service-a.service...
Jun 13 17:28:47 demo systemd[1]: demo-service-a.service: Succeeded.
Jun 13 17:28:47 demo systemd[1]: Stopped demo-service-a.service.
Jun 13 17:28:47 demo demo-service-b[1262]: Terminated
Jun 13 17:28:47 demo demo-service-b[1262]: stopping service B
Jun 13 17:28:47 demo systemd[1]: Stopping demo-service-b.service...
Jun 13 17:28:47 demo systemd[1]: demo-service-b.service: Succeeded.
Jun 13 17:28:47 demo systemd[1]: Stopped demo-service-b.service.
Jun 13 17:28:48 demo debamax-upgrade[1272]: 2. simulating the upgrade: new package gets installed, daemon gets restarted
Jun 13 17:28:48 demo systemd[1]: Stopped target Graphical Interface.
Jun 13 17:28:48 demo systemd[1]: Stopping Graphical Interface.
Jun 13 17:28:48 demo systemd[1]: Stopped target Multi-User System.
Jun 13 17:28:48 demo systemd[1]: Stopping Multi-User System.
Jun 13 17:28:48 demo systemd[1]: Started demo-service-b.service.
Jun 13 17:28:48 demo debamax-daemon[1266]: Terminated
Jun 13 17:28:48 demo debamax-daemon[1266]: stopping debamax-daemon
Jun 13 17:28:48 demo systemd[1]: Stopping debamax-daemon.service...
Jun 13 17:28:48 demo systemd[1]: Started demo-service-a.service.
Jun 13 17:28:48 demo demo-service-b[1277]: starting service B
Jun 13 17:28:48 demo systemd[1]: debamax-daemon.service: Succeeded.
Jun 13 17:28:48 demo systemd[1]: Stopped debamax-daemon.service.
Jun 13 17:28:48 demo demo-service-a[1278]: starting service A
Jun 13 17:28:48 demo systemd[1]: Started debamax-daemon.service.
Jun 13 17:28:48 demo debamax-daemon[1280]: starting debamax-daemon
Jun 13 17:28:49 demo debamax-upgrade[1272]: stopping debamax-upgrade
Jun 13 17:28:49 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:28:49 demo systemd[1]: Started debamax-upgrade.service.
Jun 13 17:28:49 demo systemd[1]: Reached target Multi-User System.
Jun 13 17:28:49 demo systemd[1]: Reached target Graphical Interface.
Jun 13 17:28:49 demo systemd[1]: Starting Update UTMP about System Runlevel Changes...
Jun 13 17:28:49 demo systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Jun 13 17:28:49 demo systemd[1]: Started Update UTMP about System Runlevel Changes.

Basically, both demo services are stopped as expected in the first place, but when the debamax-daemon unit is restarted, the same target dance (as seen in the previous section) happens. As a side effect of exiting and re-entering targets, services that were purposefully stopped are started again!

Spotting this while testing the upgrade process was the final incentive to move from the early “let's use start instead of restart” workaround mentioned in the previous section… to a real fix, and that's the point where the differences between RequiredBy= and WantedBy= were analyzed.

Fixing metadata

Let's see what happens with proper metadata, remembering to first disable the service unit (to get the “wrong” symlink removed), before switching from RequiredBy= to WantedBy=, and enabling the service unit again:

root@demo:~# systemctl disable --now debamax-daemon
Removed /etc/systemd/system/multi-user.target.requires/debamax-daemon.service.

root@demo:~# sed 's/RequiredBy=/WantedBy=/' -i /etc/systemd/system/debamax-daemon.service

root@demo:~# systemctl enable --now debamax-daemon
Created symlink /etc/systemd/system/multi-user.target.wants/debamax-daemon.service \
  → /etc/systemd/system/debamax-daemon.service.
----
Jun 13 17:30:59 demo systemd[1]: Reloading.
Jun 13 17:30:59 demo debamax-daemon[1280]: Terminated
Jun 13 17:30:59 demo debamax-daemon[1280]: stopping debamax-daemon
Jun 13 17:30:59 demo systemd[1]: Stopping debamax-daemon.service...
Jun 13 17:30:59 demo systemd[1]: debamax-daemon.service: Succeeded.
Jun 13 17:30:59 demo systemd[1]: Stopped debamax-daemon.service.
…
Jun 13 17:32:02 demo systemd[1]: Reloading.
Jun 13 17:32:02 demo systemd[1]: Started debamax-daemon.service.
Jun 13 17:32:02 demo debamax-daemon[1317]: starting debamax-daemon

For the avoidance of doubt, here's the new version of the systemd service unit for the daemon: (/etc/systemd/system/debamax-daemon.service):

[Service]
Type=simple
ExecStart=/usr/local/sbin/debamax-daemon

[Install]
WantedBy=multi-user.target

Let's run the upgrade scenario again:

root@demo:~# systemctl start debamax-upgrade
----
Jun 13 17:32:37 demo systemd[1]: Starting debamax-upgrade.service...
Jun 13 17:32:37 demo debamax-upgrade[1321]: starting debamax-upgrade
Jun 13 17:32:37 demo debamax-upgrade[1321]: 1. stopping some services during the upgrade
Jun 13 17:32:37 demo demo-service-a[1278]: Terminated
Jun 13 17:32:37 demo demo-service-a[1278]: stopping service A
Jun 13 17:32:37 demo systemd[1]: Stopping demo-service-a.service...
Jun 13 17:32:37 demo systemd[1]: demo-service-a.service: Succeeded.
Jun 13 17:32:37 demo systemd[1]: Stopped demo-service-a.service.
Jun 13 17:32:37 demo demo-service-b[1277]: Terminated
Jun 13 17:32:37 demo demo-service-b[1277]: stopping service B
Jun 13 17:32:37 demo systemd[1]: Stopping demo-service-b.service...
Jun 13 17:32:37 demo systemd[1]: demo-service-b.service: Succeeded.
Jun 13 17:32:37 demo systemd[1]: Stopped demo-service-b.service.
Jun 13 17:32:38 demo debamax-upgrade[1321]: 2. simulating the upgrade: new package gets installed, daemon gets restarted
Jun 13 17:32:38 demo debamax-daemon[1317]: Terminated
Jun 13 17:32:38 demo debamax-daemon[1317]: stopping debamax-daemon
Jun 13 17:32:38 demo systemd[1]: Stopping debamax-daemon.service...
Jun 13 17:32:38 demo systemd[1]: debamax-daemon.service: Succeeded.
Jun 13 17:32:38 demo systemd[1]: Stopped debamax-daemon.service.
Jun 13 17:32:38 demo systemd[1]: Started debamax-daemon.service.
Jun 13 17:32:38 demo debamax-daemon[1326]: starting debamax-daemon
Jun 13 17:32:39 demo debamax-upgrade[1321]: stopping debamax-upgrade
Jun 13 17:32:39 demo systemd[1]: debamax-upgrade.service: Succeeded.
Jun 13 17:32:39 demo systemd[1]: Started debamax-upgrade.service.

This time, demo services are stopped as previously, but there's no more target dance, which means that demo services are not started again. \o/

Conclusion

Our main takeaway would be: be extra careful before thinking about using the RequiredBy= keyword!

Of course, when in doubt, checking the documentation (systemd.unit) might have saved some troubles, and looking at the RequiredBy= and WantedBy= sections, plus their redirections to Requires= and Wants= would likely have yielded the same outcome, but without being entirely clear about the differences between both approaches: Often, it is a better choice to use Wants= instead of Requires= in order to achieve a system that is more robust when dealing with failing services.


Published: Sun, 14 Jun 2020 17:00:00 +0200

Installing Jitsi behind a reverse proxy

Jitsi logo

Introduction

Update (April 2020): Since the first publication of this article, the “Jitsi configuration” section has been updated to reflect changes upstream. A “More about STUN servers section has been added as well.

Videoconferencing with the official meet.jit.si instance has always been a pleasure, but it seemed a good idea to research how to install a Jitsi instance locally, that could be used by customers, by members of the local Linux Users Group (COAGUL), or by anyone else.

This instance is available at jitsi.debamax.com and that service should be considered as a beta: it’s just been installed, and it’s still running the stock configuration. Feel free to tell us what works for you and what doesn’t!

Networking vs. virtualization host

One host was already set up as a virtualization environment, featuring libvirt, managing LXC containers and QEMU/KVM virtual machines. In this article, we focus on IPv4 networking. Basically, the TCP/80 and TCP/443 TCP ports are exposed on the public IP, and NAT’d to one particular container, which acts as a reverse proxy. The running Apache server defines as many VirtualHosts as there are services, and acts as a reverse proxy for the appropriate LXC container or QEMU/KVM virtual machine.

Schematically, here’s what happens:

What does that mean for the Jitsi installation? Well, Jitsi expects those ports to be available:

For this specific host, TCP/4443 and UDP/10000 were available, and have been NAT’d as well to the Jitsi virtual machine directly. Given the existing services, the same couldn’t be done for the TCP/443 port, which explains the need for the following section.


NAT and reverse proxy for Jitsi (click for full view)
NAT and reverse proxy for Jitsi

Note: A summary of the host’s iptables configuration is available in the annex at the bottom of this article.

Apache as a reverse proxy

A new VirtualHost was defined on the apache2 service running as reverse proxy. The important parts are quoted below:

<VirtualHost *:80>
    ServerName jitsi.debamax.com
    RedirectMatch permanent ^(?!/\.well-known/acme-challenge/).* https://jitsi.debamax.com/
</VirtualHost>

<VirtualHost *:443>
    SSLProxyEngine on
    SSLProxyVerify none
    SSLProxyCheckPeerCN off
    SSLProxyCheckPeerName off
    SSLProxyCheckPeerExpire off

    ProxyPass        / https://192.168.122.120/
    ProxyPassReverse / https://192.168.122.120/
</VirtualHost>

The redirections set up on the TCP/80 port were already mentioned in the previous section, so let’s concentrate on the TCP/443 port part.

The ProxyPass and ProxyPassReverse directives act on /, meaning every path will be proxied to the Jitsi virtual machine. If one wasn’t using VirtualHost directives to distinguish between services, one could be dedicating some specific paths (“subdirectories”) to Jitsi, and proxying only those to the Jitsi instance. But let’s concentrate on the simpler “the whole VirtualHost is proxied” case.

The first SSLProxyEngine on directive is needed for apache2 to be happy with proxying requests to a server using HTTPS, instead of plain HTTP.

All other SSLProxy* directives aren’t too nice as they disable all checks! Why do that, then? The answer is that Jitsi’s default installation is setting up an NGINX server with HTTP-to-HTTPS redirections, and it seemed easier to directly forward requests to the HTTPS port, disabling all checks since that NGINX server was installed with a self-signed certificate. One could deploy a suitable certificate there instead and enable the checks again, instead of using this “StackOverflow-style heavy hammer” (some directives might not even be needed).

Jitsi configuration

Jitsi itself was installed on a QEMU/KVM virtual machine, running a basic Debian 10 (buster) system, initially provisioned with 2 CPUs, 4 GB RAM, 3 GB virtual disk. Its IP address is 192.168.122.120, which is what was configured as the target of the ProxyPass* directives in the previous section.

The installation was done using the quick-install.md documentation, entering jitsi.debamax.com as the FQDN, and opting for a self-signed certificate (letting the reverse proxy in charge of the Let’s Encrypt certificate dance, like it does for all VirtualHosts).

Update (April 2020): Since late March 2020, upstream switched from videobridge to videobridge2. Another important change is that the jitsi-meet-turnserver package is pulled through jitsi-meet’s Recommends, as can be seen in APT metadata (wrapped for readability):

Depends:
 jitsi-videobridge2 (= 2.1-157-g389b69ff-1),
 jicofo (= 1.0-539-1),
 jitsi-meet-web (= 1.0.3928-1),
 jitsi-meet-web-config (= 1.0.3928-1),
 jitsi-meet-prosody (= 1.0.3928-1)
Recommends:
 jitsi-meet-turnserver (= 1.0.3928-1) | apache2

TURN servers make it possible for clients to exchange streams in a peer to peer fashion when there are only two of them, by finding a way to traverse NATs. In the setup being documented here, the easiest is to not install the jitsi-meet-turnserver package (as documented recently in quick-install.md).

Now, a very important point needs to be addressed (no pun intended), which isn’t so much related to the fact one is running behind a reverse proxy, but related to the fact TCP/4443 and UDP/10000 ports are NAT’d: the videobridge component needs to know about that, and needs to know about the public IP and the local IP. In this context, the local IP is the Jitsi virtual machine’s local IP, where the NAT for TCP/4443 and UDP/10000 points to, and it is not the reverse proxy’s local IP. That’s why those lines have to be added to the /etc/jitsi/videobridge/sip-communicator.properties configuration file:

org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=192.168.122.120
org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS=163.172.19.80

[ Hint: Beware, there’s another sip-communicator.properties configuration file, for the jicofo component! ]

Additionally, a default setting needs to be commented out (in the same file), because the TURN server isn’t installed:

#org.ice4j.ice.harvest.STUN_MAPPING_HARVESTER_ADDRESSES=meet-jit-si-turnrelay.jitsi.net:443

Remember to restart the service:

systemctl restart jitsi-videobridge2

Update (April 2020): Until late March 2020, this systemd server unit used to be called jitsi-videobridge instead.

More about STUN servers

A privacy-concerned user was kind enough to inform a number of Jitsi instance administrators (including us) that the default Jitsi configuration uses Google’s STUN servers. This was fixed through a recent pull request: config: use Jitsi's STUN servers by default, instead of Google's.

Without waiting for a new upstream release, administrators can tweak their local configuration (in /etc/jitsi/meet/F.Q.D.N-config.js). This can be checked client-side by running tcpdump and checking packets are seen when a 2-participant conversation is set up:

tcpdump host meet-jit-si-turnrelay.jitsi.net

For completeness: Jitsi’s own infrastructure relies on Amazon Web Services at the moment.

Annex: host networking configuration

The relevant iptables rules on the host are the following (leaving aside the usual MASQUERADING which is required when using NAT):

Chain FORWARD (filter table)
target     prot opt source               destination
ACCEPT     tcp  --  0.0.0.0/0            192.168.122.100      tcp dpt:80
ACCEPT     tcp  --  0.0.0.0/0            192.168.122.100      tcp dpt:443
ACCEPT     tcp  --  0.0.0.0/0            192.168.122.120      tcp dpt:4443
ACCEPT     udp  --  0.0.0.0/0            192.168.122.120      udp dpt:10000

Chain PREROUTING (nat table)
target     prot opt source               destination
DNAT       tcp  --  0.0.0.0/0            163.172.19.80        tcp dpt:80 to:192.168.122.100:80
DNAT       tcp  --  0.0.0.0/0            163.172.19.80        tcp dpt:443 to:192.168.122.100:443
DNAT       tcp  --  0.0.0.0/0            163.172.19.80        tcp dpt:4443 to:192.168.122.120:4443
DNAT       udp  --  0.0.0.0/0            163.172.19.80        udp dpt:10000 to:192.168.122.120:10000

Published: Wed, 18 Mar 2020 10:15:00 +0100
Last modified: Thu, 02 Apr 2020 03:30:00 +0200

Fixing faulty synchronization in Nextcloud

Nextcloud logo

Introduction

Some problems were detected on a customer’s Nextcloud instance: disk space filled up all of a sudden, leading to service disruptions. Checking the Munin graphs, it seemed that disk usage gently increased from 80% to 85% over an hour, before spiking at 100% in a few extra minutes.

Looking at the “activity feed” (/apps/activity), there were a few minor edits by some users, but mostly many “SpecificUser has modified [lots of files]” occurrences. After allocating some extra space and making sure the service was behaving properly again, it was time to check with that SpecificUser whether those were intentional changes, or whether there might have been some mishap…

It turned out to be the unfortunate consequence of some disk maintenance operations that led to file system corruption. After some repair attempts, it seems the Nextcloud client triggered a synchronization that involved a lot of files, until it got interrupted because of the disk space issue on the server side. The question became: How many of the re-synchronized files might have been corrupted in the process? For example, /var/lib/dpkg/status had been replaced by a totally different file on the affected client.

Searching for a solution

Because of the possibly important corruption, a way to get back in time was to take note of all the “wanted” changes by other users, put them aside, restore from backups (previous night), and replay the changes. But then it was feared that any Nextcloud client having seen the new files could attempt to re-upload them, replacing the files that would have been just restored.

That solution wasn’t very appealing, that’s why Cyril tried his luck on Twitter, asking whether there would be a way to revert all modifications from a given user during a given timeframe.

Feedback from the @Nextclouders account was received shortly after that, pointing out that such issues could have been caught client-side and a warning might have been displayed before replacing so many files, but that wasn’t the case unfortunately, and we were already in an after-the-fact situation.

The second lead could be promising if such an issue would be to happen more than once. All the required information is in the database already, and there’s already a malware app that knows how to detect files that could have been encrypted by a cryptovirus, and which would help restore them by reverting to the previous version. It should be possible to create a new application, implementing the missing feature by adjusting the existing malware app…

Diving into the actual details

At this stage, the swift reply from Nextcloud upstream seemed to indicate that early research didn’t miss any obvious solutions, so it was time to assess what happened on the file system, and see if that would be fixable without resorting to either restoring from backups or creating a new application…

It was decided to look at the last 10 hours, making sure to catch all files touched that day (be it before, during, or after the faulty synchronization):

    cd /srv/nextcloud
    find -type f -mmin -600 | sort> ~/changes

Searching for patterns in those several thousand files, one could spot those sets:

Good news: Provided there are no more running Nextcloud clients trying to synchronize things with the server for that SpecificUser, all those files under data/specificuser/uploads could go away entirely, freeing up 10 GiB.

Next: preview files were only 100 MiB, meaning spending more time on them didn’t seem worth it.

The remaining parts were of course the interesting ones: what about those files/ versus files_versions/ entries?

Versions in Nextcloud

Important note: The following is based on observation on this specific Nextcloud 16 instance, which wasn’t heavily customized; exercise caution, and use at your own risk!

Without checking either code or documentation, it seemed pretty obvious how things work: when a given foo/bar.baz file gets modified, the previous copy is kept by Nextcloud, moving it from under files/ to under files_versions/, adding a suffix. It is constructed this way: .vTIMESTAMP, where TIMESTAMP is expressed in seconds since epoch. Here’s an example:

./data/commonuser/files/foo/bar.baz
./data/commonuser/files_versions/foo/bar.baz.v1564577937

To convert from a given timestamp:

$ date -d '@1564577937'
Wed 31 Jul 14:58:57 CEST 2019

$ date -d '@1564577937' --rfc-2822
Wed, 31 Jul 2019 14:58:57 +0200

$ date -d '@1564577937' --rfc-3339=seconds
2019-07-31 14:58:57+02:00

Given there’s a direct mapping (same path, except under different directories) between an old version and its most recent file, this opened the way for a very simple check: “For each of those versions, does that version match the most recent file?”

If that version has the exact same content, one can assume that the Nextcloud client re-uploaded the exact same file (as a new version, though), and didn’t re-upload a corrupted file instead; which means that the old version can go away. If that version has a different content, it has to be kept around, and users notified so that they can check whether the most recent file is desired, or if a revert to a previous version would be better (Cyril acting as a system administrator here, rather than as an end-user).

Here’s a tiny shell script consuming the ~/changes file containing the list of recently-modified files (generated with find as detailed in the previous section), filtering and extracting each version (called $snapshot in the script for clarity), determining the path to its most recent file by dropping the suffix and adjusting the parent directory, and checking for identical contents with cmp:

#!/bin/sh
set -e

cd /srv/nextcloud
grep '/files_versions/' ~/changes | \
while read snapshot; do
  current=$(echo "$snapshot" | sed 's/\.v[0-9][0-9]*$//' | sed 's,/files_versions/,/files/,')
  if cmp -s "$snapshot" "$current"; then
    echo "I: match for $snapshot"
  else
    echo "E: no match for $snapshot"
  fi
done

At this point, it became obvious that most files were indeed re-uploaded without getting corrupted, and it seemed sufficient to turn the first echo call into an rm one to get rid of their old, duplicate versions and regain an extra 2 GiB. Cases without a match seemed to resemble the list of files touched by other users, which seemed like good news as well. To be on the safe side, that list was mailed to all involved users, so that they could check that current files were the expected ones, possibly reverting to some older versions where needed.

Conclusion

Fortunately, there was no need to develop an extra application to implement a new “let’s revert all changes from this user during that timeframe” feature to solve this specific case. Observation plus automation shrank the list of 2500+ modified files to just a handful that needed manual (user) checking. Some time lost, and some space that was reclaimed in the end. Not too bad for a Friday afternoon…

Many thanks to the Nextcloud team and community for a great piece of software, and for a very much appreciated swift reply!


Published: Sat, 29 Feb 2020 01:00:00 +0100