Use this when you have code changes (a PR, a diff, or modified files) and want a structured review covering test coverage, naming, error handling, security, and alignment with the story's acceptance criteria.
Process
Step 1: Gather inputs
Ask the user to provide:
- Code changes — a diff, PR link (if MCP available), file paths, or pasted code. Include both new and modified files.
- Story or acceptance criteria — the user story this code implements (optional but strongly recommended). Used to check that the code actually delivers what was specified.
- Test files — the tests covering these changes (if separate from the diff).
- Team conventions — any specific coding standards, linting rules, or conventions to check against (optional — the review will use general best practices if none provided).
Step 2: Review the changes
Evaluate the code across these dimensions:
Acceptance criteria alignment
- Does the code implement what the story specifies?
- Are all acceptance criteria covered, or are some missing from the implementation?
- Does the code do things the story didn't ask for (scope creep)?
Test coverage
- Are there tests for the changed code?
- Do tests cover happy path, error paths, and edge cases?
- Are tests testing behavior (not implementation details)?
- Are there missing test cases?
- Gherkin traceability: Is there a test for every Gherkin scenario in the story's AC? Any AC without a corresponding test?
Code quality
- Naming — are variables, functions, and classes named clearly? Can you understand the code without comments?
- Structure — is the code organized logically? Are functions doing one thing?
- Duplication — is there copy-pasted logic that should be extracted?
- Complexity — are there deeply nested conditionals, long functions, or hard-to-follow control flow?
Error handling
- Are errors handled, not swallowed?
- Do error messages help with debugging?
- Are edge cases handled (null inputs, empty collections, missing data)?
- Are external calls (APIs, database, file system) wrapped with appropriate error handling?
Architecture and system impact
- Coupling — does this change introduce tight coupling between modules or services? Could these components change independently?
- Data flow — does data flow in the expected direction? Are there new cross-service calls that should be async instead of sync?
- Observability — are new code paths instrumented? Are errors logged with sufficient context for debugging in production?
- Migration safety — if this changes a database schema or API contract, is it backward compatible? Is there a rollback path?
- Dependency direction — do dependencies point in the right direction? (Domain logic should not depend on infrastructure details.)
Security
- Input validation — is user input validated before use?
- Injection risks — SQL injection, XSS, command injection?
- Authentication/authorization — are access checks in place?
- Data exposure — are sensitive fields (passwords, tokens, PII) protected?
- Dependency concerns — any new dependencies with known vulnerabilities?
Style and consistency
- Does the code follow the project's existing patterns?
- Are imports organized consistently?
- Is formatting consistent with the rest of the codebase?
Step 3: Generate the review
Output a structured review:
Code Review: (PR title or description)
Summary
(One sentence: Approve / Approve with suggestions / Request changes)
Must-fix issues
(Issues that should be resolved before merging)
- [file:line] (issue) — (why it matters) — (suggested fix)
Suggestions
(Improvements that would make the code better but aren't blockers)
- [file:line] (suggestion) — (rationale)
Acceptance criteria coverage
| Criterion | Covered? | Notes |
|---|---|---|
| (AC from story) | Yes/No/Partial | (details) |
Test coverage assessment
- (What's well tested)
- (What's missing)
Test quality
| Gherkin scenario | Test exists? | Test type | Quality |
|---|---|---|---|
| (AC scenario) | Yes/No | Unit/Integration/E2e | Adequate / Needs improvement / Missing |
Related skills: Use
/test-coverage-auditfor a full AC-to-test mapping across the codebase.
Architecture impact
- (Assessment of coupling, data flow, observability additions)
- (Any concerns about system-level implications of this change)
What's good
- (Specific things done well — call out clean patterns, good naming, thorough error handling)
Step 4: Discuss
Ask the user:
- Any issues you'd like me to explain further?
- Want me to write fixes for the must-fix issues?
- Want me to write the missing tests?
Step 5: Team-level review standards (for engineering leads)
If the user is setting up or improving code review culture for a team, generate team review standards:
## Code Review Standards -- {{team name}}
### Review SLOs
| Change type | Target turnaround | Escalation if missed |
|------------|------------------|---------------------|
| Bug fix (< 50 lines) | 4 business hours | Author pings reviewer directly |
| Feature (50-300 lines) | 1 business day | Raise in standup |
| Large change (300+ lines) | 2 business days | Split the PR or schedule synchronous review |
| Security-sensitive | 4 business hours | Ping security champion + lead |
### Review depth by change type
| Change type | Expected depth | Focus areas |
|------------|---------------|-------------|
| Bug fix | Targeted -- verify the fix, check for regression risk | Root cause addressed (not just symptom), test coverage for the bug |
| New feature | Thorough -- full review of all dimensions | AC alignment, test coverage, error handling, security |
| Refactor | Structure-focused -- verify behavior preservation | Tests still pass, no behavior changes, improved readability |
| Dependency update | Risk-focused -- check for breaking changes | Changelog review, breaking API changes, security advisories |
| Config/infra change | Impact-focused -- blast radius assessment | What environments affected, rollback path, monitoring |
### Reviewer assignment
- **Round-robin** for general changes -- spread knowledge, avoid bottlenecks
- **Domain expert** for security, performance, or architecture changes
- **Two reviewers** for changes touching shared libraries, APIs, or data models
- **Pair review** (synchronous) for changes > 500 lines or cross-cutting concerns
### Review health metrics
Track quarterly:
- **Review cycle time:** Median time from PR opened to approved (target: < 1 business day)
- **Review throughness:** % of PRs with at least one substantive comment (not just "LGTM")
- **Knowledge distribution:** Number of unique reviewers per codebase area (avoid single-reviewer silos)
- **PR size distribution:** Track % of PRs > 300 lines (goal: shrink over time)
Ask the user:
- Does the team have existing review norms this should build on?
- Any pain points with current review practices (bottlenecks, rubber-stamping, nitpick wars)?
- Want me to draft a team-specific version based on your current workflow?
Output location
Present the review as formatted text in the conversation.
Example Output
Input
- Code changes: Diff for a new
/api/v1/payments/refundendpoint in a FastAPI service — adds aPOSTroute, aRefundServiceclass, and a Pydantic request model - Story/AC: "As a support agent, I can issue a full or partial refund for a completed order so that customer disputes are resolved without manual DB edits. AC: (1) Given a valid order ID and refund amount ≤ original charge, the refund is processed and a
refund.createdevent is emitted. (2) Given an amount exceeding the original charge, a 422 is returned with a clear error message. (3) Given an already-refunded order, a 409 is returned. (4) All refund attempts are logged with agent ID, order ID, amount, and outcome." - Test files:
tests/test_refund_endpoint.py— 3 test functions covering the happy path and the over-amount case - Team conventions: PEP 8, no bare
except, all external calls must use the internalhttp_clientwrapper, secrets viasettingsobject only
Output (abbreviated)
Code Review: Add POST /api/v1/payments/refund
Summary
Request changes — two must-fix security and error-handling issues before merging; AC coverage is incomplete, and one test scenario is missing.
Must-fix issues
-
services/refund_service.py:34Bareexceptclause swallows all exceptions from the Stripe API call — violates team convention and hides failures silently. Wrap inexcept StripeError as eand re-raise as a structuredPaymentGatewayErrorwith the original message and request ID attached. -
routers/refund.py:18refund_amountis read directly from the request body and passed to Stripe without validating it is a positive number. A value of0or-50.00will reach the external API. Add a Pydantic@validatorenforcingamount > 0. -
services/refund_service.py:61Stripe API key is accessed viaos.environ["STRIPE_SECRET_KEY"]— violates the team convention requiring secrets through thesettingsobject. Change tosettings.stripe_secret_key.
Suggestions
-
services/refund_service.py:45–72process_refund()is doing three things: validating business rules, calling Stripe, and emitting the event. Extract into_validate_refund_eligibility()and_emit_refund_event()— easier to test and reason about. -
routers/refund.py:12agentin the request model would be clearer asagent_id: UUID— matches the naming used everywhere else in the codebase and enforces the type at the boundary. -
services/refund_service.py:58Consider returning the Striperefund_idin the 200 response body so callers can correlate with Stripe's dashboard without a separate lookup.
Acceptance criteria coverage
| Criterion | Covered? | Notes |
|---|---|---|
Valid refund processed + refund.created event emitted | Partial | Refund is processed; event emission is implemented but not verified in tests |
| Amount > original charge → 422 with clear message | Yes | Validated in model and service; error message is descriptive |
| Already-refunded order → 409 | No | No check against existing refund status in process_refund(); no test |
| All attempts logged with agent ID, order ID, amount, outcome | No | Logging call on line 67 omits agent_id and doesn't log failure outcomes |
Test coverage assessment
- Well tested: Happy path (201 response, correct payload shape) and over-amount validation (422) are covered with clear assertions
- Missing: No test for the duplicate-refund 409 case (AC #3); no test for Stripe API failure (500 vs. 502 distinction); no test verifying the
refund.createdevent is actually emitted; edge case ofamount = 0not covered
Test quality
| Gherkin scenario | Test exists? | Test type | Quality |
|---|---|---|---|
| Valid full/partial refund processed | Yes | Unit (mocked Stripe) | Adequate |
| Amount exceeds original charge → 422 | Yes | Unit | Adequate |
| Already-refunded order → 409 | No | — | Missing |
| All attempts logged with required fields | No | — | Missing |
Related skills: Use
/test-coverage-auditfor a full AC-to-test mapping across the payments service.
Architecture impact
- Coupling:
RefundServiceinstantiates its ownstripe.Clientinternally — makes the dependency hard to swap for testing and couples the service layer to the SDK. Inject the client via the constructor or use the existinghttp_clientwrapper per team convention. - Observability: No structured log on the success path includes the Stripe
refund_id, meaning production incidents can't be correlated to Stripe events without manual lookup. Add it to the success log line. - Migration safety: No schema changes — low risk. The
refund.createdevent is a new event type; confirm the event bus consumer team is aware before merging to avoid silent drops.
What's good
- The Pydantic request model is clean and uses
Decimalforamountrather thanfloat— correct choice for currency - 409 vs. 422 status code distinction is semantically accurate (conflict vs. validation error)
- The router stays thin — business logic is correctly pushed into
RefundService, not the route handler - Dependency on
settingsis used correctly everywhere except the one flagged violation
Want me to:
- Write fixes for the three must-fix issues?
- Generate the two missing test cases (409 duplicate refund + event emission assertion)?
- Explain the event bus concern in more detail?