The G.E.A.R. experiment: 47 hours of autonomous AI agent, what broke, and what I learned
You have probably seen the demos by now. OpenClaw, Hermes, and the growing list of personal AI agents that promise to run your life from the background.
The demos are clean.
This is what happens when you actually run one for 47 hours without stopping.
Over that period, I built an autonomous AI agent — G.E.A.R. (Global Execution & Automation Routine) — on a Debian server. It receives instructions via email, executes system tasks, maintains itself, runs automated model evaluations, and replies back. All communication happened through a local mail system. No dashboard. No API calls from my laptop. Just email.
64 messages exchanged. 24 sessions tracked. 12 maintenance scripts. 43 knowledge notes. 15 model evaluation runs.
And a lot of broken things.
This is not a success story. It is a field report from behind the curtain.
The setup
The machine was a modest Debian 13 box: Intel i7, 15 GB RAM, 17 GB root disk. Connected via Tailscale. Running Postfix and Dovecot for local mail. OpenCode as the AI execution engine.
The architecture was simple:
- A cron job runs every minute, checking for new emails
- When mail arrives, it spawns a background runner
- The runner launches an OpenCode session with the email as input
- The agent performs whatever task was requested
- The response gets sent back as a reply email
- The runner cleans up after itself
Non-blocking. Each email is independent. Multiple long-running tasks can coexist.
The constraint I gave myself: I only communicate via email. If the agent can't handle it through that interface, it doesn't work.
One design choice worth calling out: one email thread = one OpenCode session. Hermes has an email channel but doesn't manage parallel threads — everything collapses into a single context. That works for linear conversations but falls apart when you have multiple ongoing topics. My mapping was simpler: each email thread gets its own session, tracked in sessions.json, with subject normalization for thread continuation. It felt natural. When I replied to a maintenance thread, the agent picked up where it left off. When I started a new topic, it got a fresh session.
The model behind the agent was DeepSeek V4 via OpenCode. The model being evaluated was Qwen 3.5 9B. Two different models, two different roles.
What I asked it to do
The requests were not theoretical. They were things I would actually need from an autonomous system:
- Set up a self-maintenance loop (daily health checks, weekly updates, monthly audits)
- Troubleshoot failures in its own infrastructure
- Build a memory system using zettelkasten methodology
- Run an automated evaluation suite on a local AI model (Qwen 3.5 9B)
- Write a self-replication guide so the whole system could be reproduced on a new machine
- Monitor itself and report anomalies
I treated it like an employee I could email. Not a chatbot. Not a tool I invoke. A system that runs in the background and responds when I write to it.
What worked
The email interface actually works. After the initial setup friction, the pattern of "send email → get reply" became natural. Thread continuation worked. The agent could reference previous conversations. It felt like working with someone who reads mail, not like prompting a model.
Self-maintenance is viable. The agent built 12 maintenance scripts covering daily health checks, weekly package updates, monthly deep audits, and a 107-test self-assessment with a quantified health score. All reports consolidated into a single daily email. I got one email per morning telling me the state of the system. That is the bar.
The SSH safety system never failed after the initial fix. Early on, a firewall reset locked me out. The agent built a multi-layer safety system: emergency iptables pre-rule, a cron script checking SSH every 60 seconds, documented safe procedures. Zero outages after that.
Prompt engineering matters more than model choice, for local models. Increasing max_tokens from 512 to 8192 and adding "final answer on last line" to the prompt transformed Qwen 3.5 9B's scores from 20-40% to 100% on math and logical reasoning. But let's be honest about what happened: the 512-token limit was a dumb error on my part, and "final answer on last line" is just a better prompt that forces the chain-of-thought to complete cleanly before the model stops generating. The model was always capable. I was cutting it off mid-reasoning and then wondering why it couldn't answer. This is still a universal lesson for anyone evaluating local models, but the lesson is simpler than I first credited: give your model room to think, and tell it where to put the answer.
Maintenance without AI analysis is just a report. Maintenance with AI analysis is an improvement loop. At first, the maintenance tasks were plain shell scripts that sent me daily emails with OK/FAIL status. Useful, but passive. I was getting a report, not a fix. Later, I changed the flow: when a warn or fail was detected, the system would spawn an OpenCode session with the report context and ask the agent to diagnose and fix it. The difference was night and day. Instead of "UFW is not active — you should fix it," I got "UFW was disabled by the reset; re-enabling now, verifying SSH still works, updating notes." The shift from reporting to acting is the single most important architectural decision for any autonomous system.
The zettelkasten memory system held up. 43 atomic notes with YAML frontmatter, cross-references, and tags. The agent could find what it needed. The system is self-documenting and searchable.
What broke
1. SSH got locked out (minute 10)
A ufw --force reset dropped my active SSH session. The server was unreachable. This is the kind of thing that should never happen remotely. The fix was an emergency iptables rule and a cron script that restores SSH access every 60 seconds. The lesson: never reset a firewall on a live system. Always disable → modify → enable → verify.
2. Tailscale route hijack
Enabling --accept-routes redirected my LAN subnet through the Tailscale tunnel, breaking SSH NAT return path. Fixed with an explicit /32 route. The lesson: Tailscale's routing table injection is not harmless.
3. Model evaluation produced garbage for 8 hours (my fault)
All evaluation runs scored 20-40% because max_tokens=512 truncated the model's chain-of-thought before it reached the answer. Eight evaluation runs produced unreliable data. The fix was max_tokens=8192. This was a dumb error: I set an arbitrary token limit without measuring actual output length. The model was verbose — 800-1200 tokens for simple math — and I was cutting it off at 512. The lesson is not "prompt engineering is magic." The lesson is: measure your outputs before you set limits.
4. The self-evaluation regex bug
Three subjective task dimensions scored exactly 20% because re.search(r'[1-5]') matched the first digit in the model's reasoning — which always started with "1.". Every subjective task scored 1 out of 5. The fix is documented but was never tested before the experiment ended. The lesson: simple regex can mask complex failures. Always validate scoring logic manually.
5. Export truncation at 64 KB
OpenCode's export command produces JSON that gets truncated at ~65536 bytes, causing parsing failures. Retry logic mitigated it but didn't fix the root cause. This is likely an upstream issue.
6. Concurrent cron runs corrupted data
Two evaluation runs produced garbled output because overlapping cron invocations wrote to the same files simultaneously. A lockfile fixed it. The lesson: any cron job that writes files needs concurrency protection, even if runs are expected to be brief.
The model evaluation data — and why it needs reworking
The agent ran 15 evaluation sessions across ~18 hours, testing Qwen 3.5 9B on 8 orthogonal task dimensions. Here are the verified scores after the prompt fix:
- Math & Reasoning: 100%
- Logical Reasoning: 100%
- Instruction Following: 93%
- Knowledge & QA: 87%
- Code Generation: 73% (Python 100%, JavaScript 100%, Bash 0%)
- Creative Writing: unknown (masked by regex bug)
- Summarization: unknown (masked by regex bug)
- Translation: unknown (masked by regex bug)
Three dimensions remain unverified. That is a finding too: if you cannot measure it reliably, you cannot trust it.
But the bigger problem with the eval loop is that the objectives were too broad to be meaningful. "Math & Reasoning" covers everything from basic arithmetic to combinatorics. "Code Generation" lumps Python, JavaScript, and Bash together, even though the model scored 100% on two and 0% on the third. When you aggregate across such wide dimensions, you get a number that looks precise but doesn't tell you what the model can actually do.
For v2, the eval loop needs narrow, specific objectives. Not "can it do math?" but "can it solve multi-step algebra problems with integer constraints?" Not "can it generate code?" but "can it write a Python function that parses CSV and returns a dictionary?"
Broad dimensions hide weaknesses. Narrow ones expose them.
What this means
The experiment answered three questions I had, but it also revealed one that matters more than the rest.
Can an AI agent operate autonomously through email? Yes. The interface is natural, the architecture is non-blocking, and thread continuation works. This pattern is reusable.
Can it maintain itself? Partially. The maintenance scripts worked. The self-test caught real issues. But the system still needed me to interpret failures and direct fixes. It was not fully autonomous — it was a very capable assistant that happened to run in the background.
Is this production-ready? No. The export truncation, the regex bug, the concurrent cron corruption — these are the kinds of things that matter when you're running something unattended for weeks, not hours. The architecture is sound. The implementation needs hardening.
The real learning: the self-health loop is the only loop that matters.
After 47 hours, the pattern that produced the most value was the simplest one: assume something is broken, run your self-health test, and fix it. Not "wait for a failure report." Not "hope the scripts run correctly." Actively assume failure, verify, and repair.
This is the loop every autonomous system needs to master before it can do anything else. If your agent can't run a health check, detect a deviation, and correct it without human intervention, nothing else matters. The email interface, the memory system, the model evaluation — all of that sits on top of a foundation that says: "I know when I'm broken, and I can fix myself."
In this experiment, that loop was bolted on gradually. In v2, it needs to be the first thing built.
The self-replication guide
One of the tasks I gave the agent was to write a comprehensive guide for reproducing the entire system on a new machine. It produced 381 lines covering base OS, packages, user config, OpenCode, mail system, Tailscale, firewall, SSH safety, maintenance scripts, SSL, and a verification checklist.
I have not tested it on a fresh machine yet. That is experiment v2.
The agentic loop problem
The agentic loop in coding agents is simple to explain. PI Dev's agentic loop is two nested loops: an outer loop that breaks tasks into subtasks, and an inner loop that thinks, calls a tool, observes the result, and repeats until the subtask is done. Clean. Understandable. It works because the task space is bounded — you know when code is written.
An autonomous agent needs more levels. You are not just completing tasks. You are managing goals that decompose into tasks, tasks that produce results that need verification, failures that trigger fixes, fixes that need validation, and introspection that leads to improvements. Each level is its own loop, and each loop can spawn the others.
Hermes from Nous Research handles this with a layered approach worth studying:
- Turn lifecycle — the base loop: receive input, call the model, execute tool calls, loop back. This is the heartbeat.
- Task state management — a
todotool that reads and writes the agent's own task list, so it can track what's done, what's pending, and what's blocked across turns. - Task delegation — the ability to spawn subagents with isolated context and independent iteration budgets. The parent manages the goal, the children handle execution.
- Persistent memory —
MEMORY.mdandUSER.mdthat survive session boundaries. The agent doesn't start from scratch each time. - Context compression — when conversation exceeds 50% of the context window, middle turns are summarized. The agent keeps working without losing its thread.
- Fallback behavior — if the primary model fails, the agent switches to a fallback provider and continues. No human intervention.
I did not implement this layered structure in G.E.A.R. I had flat cron jobs and flat email sessions. The agent could do things, but it couldn't manage the relationships between things — which task depends on which result, which failure blocks which goal, which improvement should run next.
For v2, the loop architecture needs to be the first design decision. Not the email interface. Not the maintenance scripts. The loop.
What I would do differently
- Set
max_tokenshigher from the start. The 8 hours of garbage data were avoidable. - Build monitoring and token tracking from day one, not retroactively.
- Test the self-evaluation scoring logic with manual cases before trusting it.
- Add a lockfile to every cron job that writes files. Not after the first corruption. Before the first run.
- Define narrow eval objectives. "Math" is not a task. "Solve multi-step algebra with integer constraints" is.
- Build the self-health loop first. Everything else depends on it.
- Pipe maintenance reports through AI analysis from day one. A report is not a fix. An improvement loop is.
What's next
Experiment v2 will test the self-replication guide on a fresh machine. It will also fix the three remaining evaluation bugs and run a proper comparison across multiple local models.
The question is not whether AI agents can work. The question is which patterns work, for what workloads, and at what cost. The only way to answer that is to run them.
