How to think about backups: gzip.cloud design choices and rationales

Backups are boring until they are the only thing between you and a week of recovery.

This is day zero: no launch, no feature parade, just the thinking that should exist before day one of operations.

Backups exist for one reason: reproducibility at the lowest cost and fastest possible. If you lose data and cannot reproduce it, you will pay in time, money, or both.

gzip.cloud is the first brick in a larger idea: an air-gapped, cloud-like service you can run on-prem, operated by small teams who do not need to play the scalability game. The goal is manageable over scalable. A company of 1 to 500 people should be able to run this without a dedicated SRE team.

I also know I have a tendency to over-complicate systems. This project is an exercise in doing the opposite. gzip.cloud is not up yet; I am building it in public and documenting the design choices as I go.

Manageable beats scalable

Every artifact you ship is an asset and a liability. It helps you do more, and it must be maintained, patched, and understood. Manageable means maintenance cost is ridiculously low compared to the value it provides.

That is why I drive a Toyota. Not because it is exciting, but because it is boring in the best way.

That is the design bar here.

The core promise: backups made simple

gzip.cloud is not sophisticated. It is simple by design:

Low failure modes
Boring by default
Fully open source (source will live at advanced-stack/gzip.local)
Usable by a single human

This service will exist because backups are the one area where clever is usually expensive.

The threat model I actually care about

These are the failures I have seen or lived through:

The medium died
The backup cron failed silently
The credit card expired and backups were purged
A bucket was deleted by accident
Ransomware encrypted the source
A leaked access key exposed data
The worst case: no backups at all

Two simple examples of the last one:

A friend asked Claude Cowork to reorganize his macOS desktop. It decided a folder named "Private" was too messy and ran rm -rf on it. No backup.
A colleague went to delete node_modules, hit an unlucky space, and deleted his entire development folder. No backup.

Everything I build must block these. If it does not, it is likely a distraction.

Design choice 1: a universal storage brick (SFTP)

I changed my mind on restic as the front door. The client should be free to use any backup tool they already trust: restic, rsync, Déjà Dup, manual scp, even a custom script.

SFTP is plain, universal, and secure. It means the service is a storage brick, not a toolchain prison. If you can push files securely, you can use gzip.cloud.

That is the point.

Design choice 2: immutable backups + built-in expiry

Backups should not be overwritten by a user. They should only expire. This removes an entire class of mistakes and attacks:

"I wiped yesterday's good backup."
"My script overwrote everything."
"A compromised key deleted the archive."

Immutable storage and expiry are not optional. They are how you keep the model honest.

The quiet failure: losing the keys

There is a failure mode people do not talk about: you did the backups, but you lost the passphrase. It is the most mundane disaster, and it is total. If you cannot decrypt, you do not have backups.

That is why root of trust is not optional. In practice there are two trust models:

Secret-based: "I know a password."
Identity-based: "I am this person." (ID card, face, fingerprints)

Both have failure modes. Secrets get lost. Identity systems get locked, spoofed, or unavailable. A backup system has to decide which one it actually depends on, and how recovery works when a human forgets.

I will cover this in the next article, but the design principle stays the same: plain, reliable, and recoverable, not clever.

Design choice 3: external reminders, not silent cron

Most backup failures are not technical. They are human. A cron job that fails quietly is worse than no cron job at all.

So the service should ask the user: "We did not receive anything this week/day/hour. Is that expected?" And it should escalate through whatever channel the user allows:

This is exactly where an AI agent is useful. Not for touching data, but for hunting silence and turning it into a decision.

I want a shared schedule model, closer to readiness or interval probes. It is explicit about expected frequency, and silence is detectable. The escalation details are still open, but the principle is not: backups do not get to fail quietly.

What certainty means

I am not optimizing for headline features. I am optimizing for certainty:

Backups exist.
We can restore in time to survive.
Backups are secure from the obvious threats.

That is it. Anything else is a distraction.

1) Backups exist

Backups that never ran are a lie.

So the system must check for actual incoming data, not just scheduled tasks.

2) Restore is feasible

Time to recover is the metric that matters.

This is where I refuse to be cheap. Redundancy and bandwidth matter more than raw storage cost. A 10 TB backup that takes weeks to restore is a slow-motion failure.

Glacier-style storage is fine as a secondary fallback, not as the primary.

Time to recover should scale with data size, not surprise you. My current mental model is simple: if you store around 100 GB, you should be able to pull at 1 Gbps. If you store around 10 TB, you should be on 10 Gbps. That means pricing needs to be tied to recovery bandwidth, not just storage volume.

3) Secure against deletion and leakage

Immutable backups, separate keys, and minimal permissions. The service never needs to see clear data.

It should behave like a dumb storage target with strong guardrails.

Retention is defined by schedule, not just days. "Last N backups" means very different things if you run hourly versus weekly. The shortest retention I am willing to allow is 3 months. Past that, it is about keeping a sane spread of time, and knowing whether a restore is full or incremental.

The human part I will not automate

Client onboarding is human.

A 15-minute call is worth it. You want a face. You want to know who to call if you see the service go silent. I want the right to call you if backups stop arriving.

This is not a faceless tool. It is a trust relationship.

Day zero is not day one

Backups are only the first brick.

The next article will be about the root of trust: encryption keys, secrets, and the plain, efficient way to manage them with minimal operations.

Same design principles. Same refusal of sophistication.

How to think about backups: gzip.cloud design choices and rationales ​

Manageable beats scalable ​

The core promise: backups made simple ​

The threat model I actually care about ​

Design choice 1: a universal storage brick (SFTP) ​

Design choice 2: immutable backups + built-in expiry ​

The quiet failure: losing the keys ​

Design choice 3: external reminders, not silent cron ​

What certainty means ​

1) Backups exist ​

2) Restore is feasible ​

3) Secure against deletion and leakage ​

The human part I will not automate ​

Day zero is not day one ​