---
name: triage
description: Production incident response. Forces a 90-second decision step before diving into code — severity, blast radius, reversibility, time pressure — then commits to a response category (rollback, hotfix, patch, monitor, ignore). Built for EAA prod incidents (IG pipeline, public site, admin tools).
trigger: /triage
---

# /triage

When something breaks in production, the instinct is to dive into code. Half the time that's right; half the time you spend an hour fixing something a 30-second rollback would have handled. /triage forces a decision step BEFORE the code dive: severity, blast radius, reversibility, time pressure, then choose the response.

This is the EAA-shaped skill. You have live users on the IG pipeline and the public site. When something breaks, the wrong move is to start typing.

## Usage

`/triage <one-line description of what's wrong in prod>`

Examples:
- `/triage Publish queue stuck for 2 hours`
- `/triage Public /skills page returning 500`
- `/triage Admin can't log in`

## What it's for

Production incidents have two failure modes for the responder:
1. **Panic-fix** — start coding before understanding the scope. Often makes it worse.
2. **Over-thought** — write a 5-page incident plan before doing anything. Site stays down.

/triage threads the middle: 90 seconds of structured assessment, then commit to a response category. After that, switch to fix mode.

## What You Must Do When Invoked

### Step 1 — Severity

Pick one. No "between B and C."

- **🔴 Down** — feature/site fully unusable for affected users.
- **🟡 Degraded** — works but slower, partial, or wrong in a noticeable way.
- **🟢 Cosmetic** — wrong-looking but functionally fine.

### Step 2 — Blast radius

Who's affected?
- **Public** — site visitors, prospects, paying customers
- **Admin** — Jake or admin users only
- **Pipeline** — automated systems (cron, queue) — invisible to humans short-term but compounds
- **One user** — single account or single record

### Step 3 — Reversibility

- **Rollback exists?** Last good deploy known? DB state safe to roll back to (no destructive migration since)?
- **Data damage?** Has the bug written bad data that survives a rollback?
- **External side effects?** Sent emails, posted to Instagram, charged cards — anything that can't be undone?

### Step 4 — Time pressure

- **Now** — minutes matter (public site down, paying users blocked)
- **Today** — hours OK (admin tool broken, internal pipeline)
- **This week** — days OK (cosmetic, edge case affecting few users)

### Step 5 — Choose the response

Map the assessment to one category:

- **Rollback** — revert the deploy. Right when (a) recent deploy is the obvious cause, (b) rollback is safe (no migration data damage), (c) severity high enough to justify undoing other work in the deploy.
- **Hotfix** — narrow surgical fix, ship now. Right when severity demands speed and rollback would lose unrelated work.
- **Patch** — non-urgent fix in next normal deploy. Right when severity is low or fix is risky enough to want review.
- **Monitor** — don't fix yet, watch the metrics. Right when you're not sure it's actually broken or it might self-resolve.
- **Ignore (for now)** — known issue, accept the cost, document. Right for true cosmetic / edge-case bugs that aren't worth the engineering time today.

### Step 6 — Output the triage card

```markdown
# /triage — <one-line>

**Severity:** 🔴 / 🟡 / 🟢
**Blast radius:** <public | admin | pipeline | one user>
**Reversibility:** <rollback safe | data risk | side effects irreversible>
**Time pressure:** <now | today | this week>

## Response: <rollback | hotfix | patch | monitor | ignore>

**Why:** <1–2 sentences>

## Next action
<Single concrete instruction. "Run: railway rollback <deploy-id>" or "Patch in <file:line>, ship in next deploy" or "Watch /admin/queue for 30 min before acting.">

## After the fix
<1 line: post-mortem? notify users? Usually "/post-mortem <topic>" once stable.>
```

### Step 7 — Stop

Don't dive into code yet. Output the triage card. Jake confirms the response category, THEN switch to fix mode.

If the user wants to skip triage and just fix, fine — but only after they've stated the response category aloud. The decision is the deliverable.

## Calibration for EAA

- **Public site down → rollback first, diagnose second.** Railway has fast rollback. Use it.
- **IG pipeline stuck → almost always patch, rarely rollback.** Pipeline state is sensitive; rolling back may re-introduce data we just deleted. Force-drain + patch is usually right.
- **Admin tool broken → hotfix or patch depending on whether Jake needs it today.**
- **Migrations gone wrong → STOP. This is its own protocol — don't run a `down` migration blindly.** Triage stops at "needs eyes," doesn't pick rollback.
- **Stuck PENDING blogs (the recurring one) → almost always patch.** Force-drain handles the immediate stuck row, post-mortem captures the pattern.

## What to avoid

- **No "let me investigate" without a triage card.** Investigation is a response, but it should be chosen explicitly, not defaulted into.
- **No skipping severity.** Even an obvious 🟢 needs naming so the response matches.
- **No coding inside this skill.** Triage outputs a card. Fix happens after, with the response category committed.
- **No multi-step responses in one card.** "Hotfix and also rollback the partial deploy" — pick one or sequence them under "Next action."
- **No rolling back a deploy that included a migration without confirming the migration is reversible first.**

## Example output (abbreviated)

```markdown
# /triage — Public /skills page returning 500

**Severity:** 🔴 Down
**Blast radius:** Public — every visitor to /skills sees an error page
**Reversibility:** Rollback safe — no migration in last deploy
**Time pressure:** Now — costs prospects, looks unprofessional

## Response: Rollback

**Why:** Last deploy was 20 min ago, /skills was new in that deploy, no migration to worry about. Rolling back undoes the new content but the cost of being down outweighs the lost work — we ship the fixed version after.

## Next action
Run Railway rollback to previous deploy. Confirm /skills returns 200. Then diagnose locally before re-shipping.

## After the fix
/post-mortem skills-page-500 once root cause is known.
```

That's the shape. Decide before you dive.
