I Wanted to Know What to Run on a Thursday. So I Built an AI Coach.

Built in a day with Claude Code and Cursor — durable workflows, LLMs that actually use your data, and the boring middle of indie engineering.

The problem

Most training plans I've followed look the same. Eight to sixteen weeks of templated PDFs, easy / threshold / long, scaled to a goal pace I picked when I was probably fitter than I currently am, then frozen the moment I hit "download". I'd skip a session because of a bad sleep, or smash a session because I felt great, and the plan would just keep marching on as if nothing had happened.

The other problem was simpler. It was Thursday. I had time for a run. I didn't know what I should do.

Strava is brilliant at telling you what you ran. It's not built to tell you what you should run. The closest you get is "compare yourself to others on this segment", which has helped exactly zero of my races.

A few weeks ago I was on a slow run with my friend Sean chatting about exactly this. He's a data person, I'm a builder, we both run reasonably seriously. Halfway through the run I realised the frustration we were articulating was a software problem. Not a hard one. A fun one.

A real coach knows your real history. They watch your runs over weeks. They notice when you've been sleeping badly, when work has been chaos, when you suddenly have a string of strong sessions. They adjust accordingly. They also tell you what to do on Thursday.

I wanted to see how close software could get to that.

What I was actually testing

Two things in parallel.

A runner question. Could you build a coaching tool that works from real Strava and Whoop data, adapts as your training drifts, and still feels like talking to a person?

An engineering question. I'd been bookmarking Hacker News stories about Inngest, Neon, Statsig, and the Anthropic SDK for months. Could I actually wire up a non-trivial product end-to-end with that stack and discover where the rough edges are?

Both questions are the kind that don't get answered by reading. You have to ship the thing.

What I built

Stride is a Next.js app at runstri.de that connects to your Strava (mandatory), Whoop (optional), and Google Calendar (optional). It pulls your last twelve months of running on first connect, then stays in sync via webhooks. It has four interesting surfaces:

Conversational plan creation. Tell the app what you're training for in plain English. It asks the right follow-up questions ("how many days a week can you run?", "what's your recent long run?"), then generates a structured training plan that maps to your real fitness today. You edit the plan back through chat at any time.
Activity feed that talks back. Every Strava run shows up alongside a one-paragraph coaching insight, generated from the actual session and the surrounding context (recent training load, recovery, plan position).
Adaptive replanning. If your last two weeks suggest you should pull back (or that you're ready to push), the app surfaces a suggestion before the gap widens. You confirm; it rebuilds the rest of the block.
Workout Surprise. A deck of single-session ideas for "I have time today, what should I do?". Sixteen hand-curated quality sessions you swipe through, send to your calendar, or share with a training partner via SMS.

There's also a chat interface where you can ask anything and get answers grounded in your real data, and an insights page with charts most apps don't show: predicted race times, recovery curves, fitness trend, year-long consistency map.

Stride architecture — Strava, Whoop, and Google Calendar feed data in; Anthropic SDK handles all LLM calls as a sidecar.

The engineering bits that taught me something

Durable workflows changed how I think about background work.

The first hard problem in this app is plan generation. The user types a goal in chat. We need to call an LLM with a serious prompt, parse a structured plan, expand it into individual workouts, write them all to Postgres, push them to Google Calendar, and send a notification email. That's at least eight steps that need to happen reliably and in order, with retries on each, while the user waits and wants progress feedback.

The traditional way is to enqueue a job, poll a worker, persist progress flags somewhere, and pray your retries don't double-write. I've done it. It's fine. It's also a lot of glue.

Inngest gave me a function-as-workflow primitive. Every step is a step.run() with its own retry and persistence semantics. If step five fails, step six doesn't even get called, and the next retry resumes from step five's last attempt. Webhooks dispatch events which fan into independent workflows. Step results are durably stored, so I can replay a failed plan generation an hour later and get the same answer.

The mental shift was bigger than the code. I stopped thinking about queues. I started thinking about events and steps. Most of my background work in this app is now declarative.

Plan generation pipeline — seven durable steps from Inngest event to notification email.

LLMs are good when you give them real data and a small job.

The other interesting unknown was how to make Claude actually feel like a coach instead of a generic chatbot. The pattern that worked, after a few rewrites, was:

Pre-fetch the user's last twelve months of runs, recent recovery, current plan position, and goals.
Format that into a tight context block (a few hundred tokens, structured).
Add a system prompt that defines the coach's role and tone.
Pass the user's question as the only free-form variable.

The result is answers like "your easy pace at 150bpm has dropped 9 seconds per kilometre over the last six weeks, you're getting fitter" instead of "training consistency is important". Real numbers, real context, real opinion.

The lesson, which I should have known from reading: LLMs aren't a creative writing machine. They're a tireless intern with access to a notepad. Give them a clean notepad and a small clear job and they're brilliant. Give them no context and a vague question and you get LinkedIn-ese.

LLM grounding pattern — structured context from the DB is assembled before the user question is passed to Claude.

Adaptive replanning is just drift detection plus an LLM with a typed schema.

When I started I assumed adaptive plans needed some kind of reinforcement learning model. I was wrong.

What actually works:

A pure-function comparison between what the plan said and what actually happened over the last two weeks. Skipped sessions, distance drift, pace drift, recovery slump if Whoop is connected.
If a meaningful drift is detected, that signal becomes a structured input to an LLM call.
The LLM produces a typed JSON proposal with a small set of allowed change types: reduce volume on these sessions, swap this type, remove this rep block, push this date out by N days.
The user sees a human-readable summary, accepts or rejects, and the change applies.

No model training. No fine-tuning. Just clean inputs and a constrained output schema. The LLM is a dispatch layer that converts "the human readable observation" into "structured changes the database knows how to apply".

I think this pattern (deterministic detection plus LLM-as-translator with a typed schema) is going to keep showing up. It's the boring middle of LLM application engineering and I want to write more about it.

Adaptive replanning — drift detection is pure function; Claude only translates the signal into a constrained set of typed changes.

Schema-first everything.

I used Drizzle ORM with a single source-of-truth schema file. Generated migrations rather than hand-written SQL. Typed end-to-end via tRPC. Validated at the edge with Zod. Validated environment variables at boot via @t3-oss/env-nextjs, so anything missing fails the build instead of a runtime stack trace at 11pm.

None of this is novel. All of it removed entire categories of bugs I've shipped before. The four hours I saved by not chasing "undefined is not a function" errors more than paid for the half-day I spent setting up the plumbing.

Three patterns worth a closer look

The previous section was the high level. For the readers who came here for engineering substance, three patterns I think are worth pulling apart.

1. Bring-your-own-key for the LLM, with a per-user dollar cap

Every Stride user supplies their own Anthropic API key. The reasoning is partly economic (I don't want to be on the hook for runaway spend on a side project) and partly architectural (it keeps the LLM cost surface inside the user account, which is where it belongs in a tool that personalises everything around them).

The flow:

User pastes their key into Settings.
The server encrypts the plaintext with AES-256-GCM, using a 32-byte master key held in a Vercel environment variable and a freshly generated 12-byte IV per encryption. The stored format is {iv}:{ciphertext}:{authTag}, base64-encoded into a single Postgres column.
On every AI request, the server fetches the encrypted blob, decrypts in memory, instantiates an Anthropic client scoped to that user's key, and routes the call. Decrypted keys never leave function scope. They aren't logged. They aren't sent to Sentry. They aren't returned to the client.
Every call writes a row to a usage_events table with model, input tokens, output tokens, and computed USD cost. A per-user daily USD cap (default $5) is enforced before each call. Over cap returns a structured error the UI handles gracefully.

The threat model isn't "someone phishes one user's key". It's "the database is dumped". With AES-256-GCM and the master key held outside Postgres, a DB dump alone doesn't compromise keys. The GCM auth tag is doing real work too: a tampered ciphertext fails to decrypt rather than silently producing garbage that gets sent to Anthropic.

There's one follow-up I haven't shipped yet. The current envelope binds every blob to a single master key. To rotate I'll need a versioned envelope ({key_version}:{iv}:{ct}:{tag}) and a rolling re-encrypt job. Worth doing before scale, not before users.

BYOK key storage and usage — the master key never touches Postgres; a DB dump alone can't recover user API keys.

2. Strava webhooks, idempotent end to end

Strava's webhook contract is at-least-once. They retry on any non-2xx response for up to 24 hours. If I treated each delivery as "create new activity", a single transient timeout would silently double-write your training history.

The pattern that worked:

The webhook handler validates the verify_token, parses the payload, and returns 200 OK immediately. The sync itself runs as an Inngest event (strava/activity.received).
The Inngest function does an upsert keyed on (user_id, strava_activity_id). The composite unique index in Postgres serialises any genuinely concurrent deliveries: one wins the insert, the other becomes an update.
Delete events soft-delete (deleted_at) rather than removing the row. Training context survives even if the user deletes a Strava activity by accident.
The Inngest event itself is the durability layer. If sync fails halfway through (Strava API hiccup, Anthropic outage on the insight-generation step), the run resumes from the last successful step rather than re-processing the whole activity.

The bug I shipped in week one: the original schema had UNIQUE(strava_activity_id) rather than UNIQUE(user_id, strava_activity_id). Two different users can legitimately share the same Strava activity ID, e.g. if they're tagged on the same group run. The first user to connect Strava would claim that ID; the second user's sync would silently fail for that record. The fix was a one-line migration. The lesson was bigger than the diff: in a multi-user system, uniqueness is almost always relative to a tenant. "This ID is unique because the upstream provider said so" is the wrong default frame.

Strava webhook flow — 200 is returned immediately; the sync runs durably as an Inngest event with upsert semantics.

3. Plan generation as templates plus deterministic expansion

The obvious way to use an LLM for plan generation is to ask it for the whole plan. "Generate a 12-week marathon plan for a runner with these characteristics." You get back JSON with 84 days of workouts. Half of them have wrong dates, three collide on the same day, and one is named "rest day with optional 30km long run".

What works instead is a two-stage pipeline:

The LLM is given the user's profile, a snapshot of their last twelve weeks of training, the race goal, and a strict JSON schema. It returns a list of session templates, not dates. Each template looks roughly like { phase, week, dayOfWeek, type, intensity, structure }.
A pure deterministic expander walks the user's calendar from today to race day, slots templates onto the matching dayOfWeek per week, applies progression rules (volume up ~10% per week, deload every fourth), and writes to the workouts table inside a single transaction.

The LLM never sees a date. It can't drift on calendar arithmetic, daylight saving, or the user's timezone. It can't produce collisions because it doesn't pick days; it picks types and slots.

Validation sits between the two stages. If the LLM returns 11 templates instead of 12, or two long runs in the same week, the expander throws a typed error and the Inngest workflow retries the LLM step with the validation message appended to the prompt. The retry is bounded; if it still fails after three attempts, the workflow halts and the user sees a real error rather than a quietly broken plan.

This is a generalisable pattern I'll use again. Use the LLM for the part where judgement matters: what to run, at what intensity, in what order. Don't use it for the part where determinism matters: what date, in what timezone, against which existing entries. The seam between LLM output and deterministic code should be a typed schema, not a free-form blob the rest of the system has to defensively parse.

Two-stage plan generation — Claude handles judgement and ordering; the deterministic expander owns all calendar arithmetic. The seam between them is a typed schema.

What I'd do differently next time

A few honest reflections after building this:

I started with the LLM features and worked backwards to the data layer. I should have done it the other way around. The data model is the foundation of everything; the LLM is just a renderer of the model. Working forwards from the schema would have saved me at least one Drizzle rebaseline.
Conversational interfaces are harder than they look, even when the LLM is doing the heavy lifting. Designing the back-and-forth flow so the user feels guided (rather than interrogated) took more iterations than the actual prompt did. The number of "today is May 4, not May 3" date-handling bugs was non-zero.
Free-tier observability is genuinely good now. Sentry, PostHog, Statsig, Vercel Analytics, Vercel Speed Insights, and Inngest all have free tiers that, combined, give you better observability than you'd have got at most companies five years ago. There's no excuse to ship blind.
Rate-limiting AI endpoints is not optional. I added per-user sliding-window limits via Upstash after I imagined what could happen if a single user accidentally pasted my chat URL into a script. The Anthropic daily dollar cap is the ultimate safety net, but rate limits are the cheap one.

What's next

The app is live at runstri.de. It works for me. I'm running my next half marathon plan off it. A handful of friends from my run crew are also using it as the early test cohort.

The next stretch of work is honest user feedback. The plan generator is OK, it could be great. The activity feed is functional, it could be a delight. The chat experience is the most magical bit but also the most variable.

If you've read this far and you're a runner, I'd love you to try it. Connect Strava, tell it your goal, see what happens.

If you've read this far and you're a builder, I'm always up for a chat about the boring middle of LLM engineering. Or the joy of durable workflows. Or what to run on a Thursday.