debugging openclaw from san diego

At 6:10am in San Diego, my AI assistant basically told me: “lol, no.”

API rate limit reached. Please try again later.

Yesterday we flew from Austin to San Diego for a family trip — me, my wife, and our one-year-old. He’s in the phase where he loves animals, knows their names, and does the sounds.

So yes, we’re here for the zoo.

Also yes, I still tried to do remote automation tinkering on vacation. (This is a character flaw I’m actively working on.)

The plan: “I’ll just add a fallback model”

I run OpenClaw (my agent is named Grover), and I use a Codex subscription. I’d been running GPT 5.3 Codex on the $20/month plan.

Then the rate-limit message hit. I’d tripped usage limits, and they weren’t resetting while I was gone.

My first thought wasn’t panic. It was: ah man.

My second thought was optimism: this is exactly the kind of thing agents are supposed to help with. We’ll fix it from Telegram. It’ll be kind of awesome.

So I did what any rational adult does at 6am on vacation: I added a new fallback provider.

OpenClaw’s docs recommend Venice.ai, so I signed up, created an API key, and told Grover to wire it in.

That’s when it got weird.

How it went sideways

The fallback provider got added. The gateway restarted. And Grover immediately felt different — slower, less sharp, more “corporate compliance email” and less “competent operator.”

I looked closer and realized I hadn’t landed on the model I intended. I wanted Kimi K2.5 as backup. I got Llama 3.3 70B.

For this job, it was the wrong fit.

Not because it’s a bad model in general. Because I needed something that could follow a messy operational thread, keep state, and do careful config surgery.

Llama responded like a polite intern: “I can do that if you want.”

Not: “I already did it.” Not: “Here’s the exact diff and how to validate it.”

And once I realized the assistant had gotten worse while still writing to config, I started to panic.

The real problem: my background jobs were still running

Before the trip, I set up a coding-agent “exploration mission” — basically a small fleet of scheduled jobs running while I was gone: an hourly global task worker, a morning report, and a couple audit passes during the day.

Key detail: those jobs run in isolated sessions.

So they don’t necessarily use the model I’m chatting with. They use defaults.

Which meant even after I upgraded my Codex subscription (and got chat back onto GPT 5.2), my cron jobs were still living in Llama land.

The hourly task log started showing symptoms. Not a clean failure. Not a loud crash. More like the agent was overwhelmed by the instructions and couldn’t hold the whole task shape.

It was trying to help. It just wasn’t executing.

That’s the worst kind of broken.

Beach intermission

We went to the beach. Our son saw the ocean for the first time. He loved it.

Then we had to leave, and he cried — real “this is injustice” crying.

I remember thinking: this is the point of the trip. Not the robots.

When we got back, he went down for a nap.

And I went back to work.

A mistake I made before the trip

I assumed I could SSH into my Mac mini if I needed to. Tailscale was set up, so I figured I had a remote escape hatch.

I never tested it.

And of course, macOS Remote Login wasn’t enabled. So when I needed SSH, it was a dead end.

That’s on me.

The actual failure mode

Here’s what took me a minute to understand: two things can be true at once.

  • The system can still be doing real work.
  • The admin/control layer can be broken.

That’s what happened. Cron was still firing, but I couldn’t manage anything.

The gateway control layer was wedged on an auth error:

device token mismatch

Not “the gateway is down.” Not “you don’t have a token.”

More like: the token I had and the token the gateway expected were no longer the same identity.

And because I was on vacation, I couldn’t just hop on the box and fix it the normal way.

So it was me and Grover, in Telegram, doing remote incident response while a toddler slept.

The fix we used (and why it worked)

This is the part I want to remember.

We treated it like a break-glass moment.

We temporarily switched the gateway to shared-password access so we could regain control, then rotated and reissued the device token so token auth worked again. After that, we switched security back to token auth and removed the temporary password.

Once control was back, we fixed what started the whole mess:

  • Set the default model for agent runs (including cron and subagents) to GPT 5.3 Codex
  • Set fallback to Kimi K2.5

Everything started behaving again.

What I’m taking with me

Remote ops from Telegram is possible.

That still feels a little absurd.

Next time, I’m testing SSH before I leave Austin.

Ok. I’m going to the zoo.

(Oh and if you were wondering. Yes, Grover did write this blog for me. I’m on vacation bro 😎)