Recently I was rethinking my off-site backup strategy. Ever since Amazon AWS launched their Glacier Deep Archive storage tier, it looked really interesting for backup use-cases. For starters, it's super cheap to keep your bytes around: $1.01376/TiB/month in most regions as of this writing. I haven't found any other storage proposition this cheap. Of course there is some fine print, but it matches my backup use-case:
- you are charged for at least 180 days of storage, even if you delete the objects before that
- you are charged for some overhead on top of the stored objects. Especially for smaller objects, this can add up significantly
But all in all, it still was very attractive.
Based on discussions I had, it seems I have a fairly beefy list of requirements:
Bringing the backup up-to-date should be cheap. This means I don't want to update the full content every single time. I prefer to only upload what has changed (rsync-style).
Keeping the backup around should be cheap.
Restoring the backup may be expensive, either in time, money or both. Since this is not my first backup, I intend to never need it. So paying several hundred euro/dollar to get my most valuable files back is fine.
Restoring should be "easy". Since I'll be using this backup in a disaster recovery mode, I prefer to be able to restore crucial files without the need for any particular software, just relying on standard tools.
Backups need to be client-side encrypted, preferably with auditable/trusted tools, independent of the backup tool.
I looked around for existing projects and/or products, but didn't find one that checked all my boxes. Especially the "client-side encrypted" seemed like a high bar. So I decided to roll my own.
S3 natively supports versioning of objects. By default, you see the latest version of an object. Or, if the last version is a "delete marker", the object is hidden by default. But you can request previous versions explicitly if you need them. In addition, I use S3 Lifecycle management to remove old versions after a configured amount of time.
This setup ensures the "easy restore"-requirement: The S3 web console gives me access to the most recent backup content, but allows access to previous versions if needed.
The client-side encryption is a bit more challenging: the backup-tool needs to figure out if a particular file has changed since the previous backup or not. You can't just encrypt the file again, and compare that, since encrypting the same file twice is not guaranteed to give the same result. My solution is to add metadata to the S3-object: the size of the plaintext file, and a hash of the plaintext file. The size is included because it's very cheap to check: if a file's size has changed, the file has changed and needs uploading. The hash is more expensive to calculate, but will spot changes even when the file size is the same. I know there is an astronomically small chance that a changed file will not result in a changed hash, but I'm taking my chances.
To make things more efficient, I also included a client-side cache of the S3 content. Listing the objects in an S3 bucket is fairly efficient, but getting the metadata requires a HEAD-call per object. At 71 seconds per 1000 calls, which is what I practically get, this takes way to long for a 200k-object backup. Not to mention the monetary cost of doing these calls every single time. But since this data is cached locally, care should be taken to never modify the bucket directly. This would cause the cache to be incorrect, and may result in bad backups.
One of the things I want to backup are my git repositories. Git has the habit of creating lots of small files. And while the above design supports small files, it gets slow and expensive: Uploading ten 1-byte files takes way longer than uploading a single 10-byte file. And since there is an additional 40kB overhead per object (some of it billed at standard S3 pricing), a single-byte file is (relatively speaking) expensive.
So I wanted to ZIP together smaller files and upload the ZIP instead. This is where it gets tricky: you want both very small ZIPs and very large ZIPs to cover contradicting needs. On one hand, you want the ZIPs to be as large as possible, to maximize the efficiency-gains. On the other hand, you want the ZIPs to be as small as possible, since a change to a single file in the ZIP requires the upload of the complete new ZIP.
From a restore point of view, you also want the ZIPs to be "logical". Given a filename and a point in time, I want to be able to cherry-pick that file from the backup with as little overhead as possible. I want to either find the file itself in the backup, or should be able to tell fairly easy which ZIP contains the given file.
I went through several variants of grouping-logic. My current implementation takes a configurable threshold as input. Files that are larger than the threshold are stored to S3 directly. Smaller files are grouped based on their filenames, where the algorithm tries to find the longest prefix that will result in a ZIP of at least the required size. So given the following files and a threshold of 1 MiB:
large-file 5 MiB .git/ a 1 KiB b 1 KiB c 1 KiB d 4 MiB
It will pass through
.git/d, but will ZIP the other files in the
.git/ folder together as
To find a file for restore, I know that the file will be either directly visible on S3, or be in first ZIP-file I encounter by removing characters from the end of the filename I'm looking for.