Fixing faulty synchronization in Nextcloud

29 février 2020

Ce billet n’a pas encore été traduit en français. La version anglaise est disponible ci-dessous.

Introduction

Some problems were detected on a customer’s Nextcloud instance: disk space filled up all of a sudden, leading to service disruptions. Checking the Munin graphs, it seemed that disk usage gently increased from 80% to 85% over an hour, before spiking at 100% in a few extra minutes.

Looking at the “activity feed” (/apps/activity), there were a few minor edits by some users, but mostly many “SpecificUser has modified [lots of files]” occurrences. After allocating some extra space and making sure the service was behaving properly again, it was time to check with that SpecificUser whether those were intentional changes, or whether there might have been some mishap…

It turned out to be the unfortunate consequence of some disk maintenance operations that led to file system corruption. After some repair attempts, it seems the Nextcloud client triggered a synchronization that involved a lot of files, until it got interrupted because of the disk space issue on the server side. The question became: How many of the re-synchronized files might have been corrupted in the process? For example, /var/lib/dpkg/status had been replaced by a totally different file on the affected client.

Searching for a solution

Because of the possibly important corruption, a way to get back in time was to take note of all the “wanted” changes by other users, put them aside, restore from backups (previous night), and replay the changes. But then it was feared that any Nextcloud client having seen the new files could attempt to re-upload them, replacing the files that would have been just restored.

That solution wasn’t very appealing, that’s why Cyril tried his luck on Twitter, asking whether there would be a way to revert all modifications from a given user during a given timeframe.

Feedback from the @Nextclouders account was received shortly after that, pointing out that such issues could have been caught client-side and a warning might have been displayed before replacing so many files, but that wasn’t the case unfortunately, and we were already in an after-the-fact situation.

The second lead could be promising if such an issue would be to happen more than once. All the required information is in the database already, and there’s already a malware app that knows how to detect files that could have been encrypted by a cryptovirus, and which would help restore them by reverting to the previous version. It should be possible to create a new application, implementing the missing feature by adjusting the existing malware app…

Diving into the actual details

At this stage, the swift reply from Nextcloud upstream seemed to indicate that early research didn’t miss any obvious solutions, so it was time to assess what happened on the file system, and see if that would be fixable without resorting to either restoring from backups or creating a new application…

It was decided to look at the last 10 hours, making sure to catch all files touched that day (be it before, during, or after the faulty synchronization):

cd /srv/nextcloud
find -type f -mmin -600 | sort > ~/changes

Searching for patterns in those several thousand files, one could spot those sets:

data/commonuser/files/…
→ files that were indeed modified by various users; files belonging to, and shared by CommonUser.
data/commonuser/files_versions/…
→ versions of files that might or might not have been modified by the SpecificUser; files also belonging to, and shared by CommonUser.
data/appdata_…/preview/…/….png
→ previews/thumbnails likely generated on the fly when the activity stream was inspected.
data/specificuser/uploads/…/…
→ temporary files used during the synchronization, which were left around.

Good news: Provided there are no more running Nextcloud clients trying to synchronize things with the server for that SpecificUser, all those files under data/specificuser/uploads could go away entirely, freeing up 10 GiB.

Next: preview files were only 100 MiB, meaning spending more time on them didn’t seem worth it.

The remaining parts were of course the interesting ones: what about those files/ versus files_versions/ entries?

Versions in Nextcloud

Important note: The following is based on observation on this specific Nextcloud 16 instance, which wasn’t heavily customized; exercise caution, and use at your own risk!

Without checking either code or documentation, it seemed pretty obvious how things work: when a given foo/bar.baz file gets modified, the previous copy is kept by Nextcloud, moving it from under files/ to under files_versions/, adding a suffix. It is constructed this way: .vTIMESTAMP, where TIMESTAMP is expressed in seconds since epoch. Here’s an example:

./data/commonuser/files/foo/bar.baz
./data/commonuser/files_versions/foo/bar.baz.v1564577937

To convert from a given timestamp:

$ date -d '@1564577937'
Wed 31 Jul 14:58:57 CEST 2019

$ date -d '@1564577937' --rfc-2822
Wed, 31 Jul 2019 14:58:57 +0200

$ date -d '@1564577937' --rfc-3339=seconds
2019-07-31 14:58:57+02:00

Given there’s a direct mapping (same path, except under different directories) between an old version and its most recent file, this opened the way for a very simple check: “For each of those versions, does that version match the most recent file?”

If that version has the exact same content, one can assume that the Nextcloud client re-uploaded the exact same file (as a new version, though), and didn’t re-upload a corrupted file instead; which means that the old version can go away. If that version has a different content, it has to be kept around, and users notified so that they can check whether the most recent file is desired, or if a revert to a previous version would be better (Cyril acting as a system administrator here, rather than as an end-user).

Here’s a tiny shell script consuming the ~/changes file containing the list of recently-modified files (generated with find as detailed in the previous section), filtering and extracting each version (called $snapshot in the script for clarity), determining the path to its most recent file by dropping the suffix and adjusting the parent directory, and checking for identical contents with cmp:

#!/bin/sh
set -e

cd /srv/nextcloud
grep '/files_versions/' ~/changes | \
while read snapshot; do
  current=$(echo "$snapshot" | sed 's/\.v[0-9][0-9]*$//' | sed 's,/files_versions/,/files/,')
  if cmp -s "$snapshot" "$current"; then
    echo "I: match for $snapshot"
  else
    echo "E: no match for $snapshot"
  fi
done

At this point, it became obvious that most files were indeed re-uploaded without getting corrupted, and it seemed sufficient to turn the first echo call into an rm one to get rid of their old, duplicate versions and regain an extra 2 GiB. Cases without a match seemed to resemble the list of files touched by other users, which seemed like good news as well. To be on the safe side, that list was mailed to all involved users, so that they could check that current files were the expected ones, possibly reverting to some older versions where needed.

Conclusion

Fortunately, there was no need to develop an extra application to implement a new “let’s revert all changes from this user during that timeframe” feature to solve this specific case. Observation plus automation shrank the list of 2500+ modified files to just a handful that needed manual (user) checking. Some time lost, and some space that was reclaimed in the end. Not too bad for a Friday afternoon…

Many thanks to the Nextcloud team and community for a great piece of software, and for a very much appreciated swift reply!