Efficient live Hanoi backups

I use a tiny script to make backups that are

live
keeps an identical copy of the source so anyone can restore or review it
transparent
you can understand the whole system by reading this page
logarithmic
multiple backups with doubling age (1 day old, 2 days old, 4, 8, 16, etc)
efficient
avoid duplication of data across those backups

Backup arrangements should be simple. An offsite identical copy of a filesystem made with rsync is the only backup I trust, and thanks to the proliferation of cloud servers it's cheap and easy to set up.

While a daily backup is better than nothing, it's much nicer to have a rotating system of backups reaching back in time. One way of doing this is to just keep every daily backup, but this requires ever-increasing storage. It's much better to use an alternating pattern of backup slots, with slot 0 being used on odd days (1, 3, 5, ...), slot 1 used every fourth day (2, 6, 10, ...), slot 3 used every eighth day (4, 12, 20, ...) etc.

This is sometimes called Tower of Hanoi rotation, and it has the nice property that you have a selection of backups with approximately doubling ages. Put another way, there is always a backup that is 1 day old, 2 days old, 4 days old, 8 days old, etc.

How to compute the slot number is an interesting problem with a trivial solution. It's the number of trailing zeroes (or ones) in the binary representation of the day number.

The local day number can be found like this in Python:

>>> import datetime
>>> datetime.date.today().toordinal()
736720

plotting the next few days in binary and counting the trailing zeroes gives the desired pattern:

736720 10110011110111010000 4
736721 10110011110111010001 0
736722 10110011110111010010 1
736723 10110011110111010011 0
736724 10110011110111010100 2
736725 10110011110111010101 0
736726 10110011110111010110 1
736727 10110011110111010111 0

So the code to produce this number is:

def day_to_slot(n):
  return bin(n)[::-1].index('1')
print day_to_slot(datetime.date.today().toordinal())

One nice capability of rsync is making hard-links to files that already exist on the remote system. So if there is a file x that's unchanged from Monday to Tuesday, rsync actually keeps one copy of x, with hard-links from the Monday and Tuesday backups. This means that the multiple backup trees all look complete, but actually share data wherever possible.

Given slot numbers in $YESTERDAY and $TODAY, the backup "system" consists of invoking rsync with the appropriate options. The command to back up your home directory is:

rsync -az --link-dest=/data/backup/$YESTERDAY ~ remote:/data/backup/$TODAY

So the entire script looks like this:

#!/bin/bash -e

read -r YESTERDAY TODAY <<<$(
python -c '
import datetime
def day_to_slot(n):
  return bin(n)[::-1].index("1")
n = datetime.date.today().toordinal()
print(day_to_slot(n - 1))
print(day_to_slot(n))
')
rsync -avz --link-dest=/data/backup/$YESTERDAY ~ remote:/data/backup/$TODAY
ssh remote touch /data/backup/$TODAY

After running for a few days you'll have a remote backup directory that looks like this:

$ ls -ltr
total 0
drwxrwxr-x 1 jamesb jamesb 10 Jan 26 08:22 4
drwxrwxr-x 1 jamesb jamesb 10 Jan 28 08:23 1
drwxrwxr-x 1 jamesb jamesb 10 Jan 29 08:24 0
drwxrwxr-x 1 jamesb jamesb 10 Jan 30 08:22 2

Each of the above directories contains a full backup, but because of rsync's --link-dest option, the storage is shared between them. Slot 0 has a full copy, and the other slots contain only the differences:

$ du -sh *
34G     0
563M    1
537M    2
646M    4