How to think about backups: gzip.cloud design choices and rationales
Backups are boring until they are the only thing between you and a week of recovery.
This is day zero: no launch, no feature parade, just the thinking that should exist before day one of operations.
Backups exist for one reason: reproducibility at the lowest cost and fastest possible. If you lose data and cannot reproduce it, you will pay in time, money, or both.
gzip.cloud is the first brick in a larger idea: an air-gapped, cloud-like service you can run on-prem, operated by small teams who do not need to play the scalability game. The goal is manageable over scalable. A company of 1 to 500 people should be able to run this without a dedicated SRE team.
I also know I have a tendency to over-complicate systems. This project is an exercise in doing the opposite. gzip.cloud is not up yet; I am building it in public and documenting the design choices as I go.
Manageable beats scalable
Every artifact you ship is an asset and a liability. It helps you do more, and it must be maintained, patched, and understood. Manageable means maintenance cost is ridiculously low compared to the value it provides.
That is why I drive a Toyota. Not because it is exciting, but because it is boring in the best way.
That is the design bar here.
The core promise: backups made simple
gzip.cloud is not sophisticated. It is simple by design:
- Low failure modes
- Boring by default
- Fully open source (source will live at advanced-stack/gzip.local)
- Usable by a single human
This service will exist because backups are the one area where clever is usually expensive.
The threat model I actually care about
These are the failures I have seen or lived through:
- The medium died
- The backup cron failed silently
- The credit card expired and backups were purged
- A bucket was deleted by accident
- Ransomware encrypted the source
- A leaked access key exposed data
- The worst case: no backups at all
Two simple examples of the last one:
- A friend asked Claude Cowork to reorganize his macOS desktop. It decided a folder named "Private" was too messy and ran
rm -rfon it. No backup. - A colleague went to delete
node_modules, hit an unlucky space, and deleted his entire development folder. No backup.
Everything I build must block these. If it does not, it is likely a distraction.
Design choice 1: a universal storage brick (SFTP)
I changed my mind on restic as the front door. The client should be free to use any backup tool they already trust: restic, rsync, Déjà Dup, manual scp, even a custom script.
SFTP is plain, universal, and secure. It means the service is a storage brick, not a toolchain prison. If you can push files securely, you can use gzip.cloud.
That is the point.
Design choice 2: immutable backups + built-in expiry
Backups should not be overwritten by a user. They should only expire. This removes an entire class of mistakes and attacks:
- "I wiped yesterday's good backup."
- "My script overwrote everything."
- "A compromised key deleted the archive."
Immutable storage and expiry are not optional. They are how you keep the model honest.
The quiet failure: losing the keys
There is a failure mode people do not talk about: you did the backups, but you lost the passphrase. It is the most mundane disaster, and it is total. If you cannot decrypt, you do not have backups.
That is why root of trust is not optional. In practice there are two trust models:
- Secret-based: "I know a password."
- Identity-based: "I am this person." (ID card, face, fingerprints)
Both have failure modes. Secrets get lost. Identity systems get locked, spoofed, or unavailable. A backup system has to decide which one it actually depends on, and how recovery works when a human forgets.
I will cover this in the next article, but the design principle stays the same: plain, reliable, and recoverable, not clever.
Design choice 3: external reminders, not silent cron
Most backup failures are not technical. They are human. A cron job that fails quietly is worse than no cron job at all.
So the service should ask the user: "We did not receive anything this week/day/hour. Is that expected?" And it should escalate through whatever channel the user allows:
- SMS
This is exactly where an AI agent is useful. Not for touching data, but for hunting silence and turning it into a decision.
I want a shared schedule model, closer to readiness or interval probes. It is explicit about expected frequency, and silence is detectable. The escalation details are still open, but the principle is not: backups do not get to fail quietly.
What certainty means
I am not optimizing for headline features. I am optimizing for certainty:
- Backups exist.
- We can restore in time to survive.
- Backups are secure from the obvious threats.
That is it. Anything else is a distraction.
1) Backups exist
Backups that never ran are a lie.
So the system must check for actual incoming data, not just scheduled tasks.
2) Restore is feasible
Time to recover is the metric that matters.
This is where I refuse to be cheap. Redundancy and bandwidth matter more than raw storage cost. A 10 TB backup that takes weeks to restore is a slow-motion failure.
Glacier-style storage is fine as a secondary fallback, not as the primary.
Time to recover should scale with data size, not surprise you. My current mental model is simple: if you store around 100 GB, you should be able to pull at 1 Gbps. If you store around 10 TB, you should be on 10 Gbps. That means pricing needs to be tied to recovery bandwidth, not just storage volume.
3) Secure against deletion and leakage
Immutable backups, separate keys, and minimal permissions. The service never needs to see clear data.
It should behave like a dumb storage target with strong guardrails.
Retention is defined by schedule, not just days. "Last N backups" means very different things if you run hourly versus weekly. The shortest retention I am willing to allow is 3 months. Past that, it is about keeping a sane spread of time, and knowing whether a restore is full or incremental.
The human part I will not automate
Client onboarding is human.
A 15-minute call is worth it. You want a face. You want to know who to call if you see the service go silent. I want the right to call you if backups stop arriving.
This is not a faceless tool. It is a trust relationship.
Day zero is not day one
Backups are only the first brick.
The next article will be about the root of trust: encryption keys, secrets, and the plain, efficient way to manage them with minimal operations.
Same design principles. Same refusal of sophistication.
