Code Review - AI Agent Skill

Use this when you have code changes (a PR, a diff, or modified files) and want a structured review covering test coverage, naming, error handling, security, and alignment with the story's acceptance criteria.

Process

Step 1: Gather inputs

Ask the user to provide:

Code changes -- a diff, PR link (if MCP available), file paths, or pasted code. Include both new and modified files.
Story or acceptance criteria -- the user story this code implements (optional but strongly recommended). Used to check that the code actually delivers what was specified.
Test files -- the tests covering these changes (if separate from the diff).
Team conventions -- any specific coding standards, linting rules, or conventions to check against (optional -- the review will use general best practices if none provided).

Step 1.5: Run and arbitrate the AI reviewers

By 2026 the first pass is automated. Agentic reviewers read the whole source tree, run CodeQL and ESLint, and can open self-healing fix branches: Macroscope's "Fix It For Me", Copilot's tool-calling agent, Greptile (whole-repo context), CodeRabbit (low-noise, roughly 2 false positives per run vs Greptile's ~11). Your job is not to re-run the linter by hand. It is to orchestrate the bots, tune their signal-to-noise, and own the judgment they cannot.

Run the bots first. Let the automated reviewer post its pass before you read a line. Note which class of issue each tool is good at (Greptile for cross-file context, CodeRabbit for precision).
Arbitrate, do not rubber-stamp. Triage every bot comment into: real and must-fix, real but low-priority, or false positive. Dismiss the noise out loud so the author sees the reasoning, not a wall of unanswered bot threads.
Tune the noise. If a tool keeps flagging the same non-issue, that is a config or rules-file gap. Note it so the team can suppress that pattern instead of training reviewers to ignore the bot.
Cover the human-judgment layer. The bots miss intent, scope, architecture fit, and whether the code actually solves the story. That is the rest of this skill. Spend your attention there.

If stacked-PR tooling is in play (Graphite, which lifted Shopify's PRs-per-dev ~33%), review each stack entry as its own small change rather than one giant diff.

Step 2: Review the changes

Evaluate the code across these dimensions:

Acceptance criteria alignment

Does the code implement what the story specifies?
Are all acceptance criteria covered, or are some missing from the implementation?
Does the code do things the story didn't ask for (scope creep)?

Test coverage

Are there tests for the changed code?
Do tests cover happy path, error paths, and edge cases?
Are tests testing behavior (not implementation details)?
Are there missing test cases?
Gherkin traceability: Is there a test for every Gherkin scenario in the story's AC? Any AC without a corresponding test?

Code quality

Naming -- are variables, functions, and classes named clearly? Can you understand the code without comments?
Structure -- is the code organized logically? Are functions doing one thing?
Duplication -- is there copy-pasted logic that should be extracted?
Complexity -- are there deeply nested conditionals, long functions, or hard-to-follow control flow?

Error handling

Are errors handled, not swallowed?
Do error messages help with debugging?
Are edge cases handled (null inputs, empty collections, missing data)?
Are external calls (APIs, database, file system) wrapped with appropriate error handling?

Architecture and system impact

Coupling -- does this change introduce tight coupling between modules or services? Could these components change independently?
Data flow -- does data flow in the expected direction? Are there new cross-service calls that should be async instead of sync?
Observability -- are new code paths instrumented? Are errors logged with sufficient context for debugging in production?
Migration safety -- if this changes a database schema or API contract, is it backward compatible? Is there a rollback path?
Dependency direction -- do dependencies point in the right direction? (Domain logic should not depend on infrastructure details.)

Security

Input validation -- is user input validated before use?
Injection risks -- SQL injection, XSS, command injection?
Authentication/authorization -- are access checks in place?
Data exposure -- are sensitive fields (passwords, tokens, PII) protected?
Dependency concerns -- any new dependencies with known vulnerabilities?
LLM and agent code -- if the change touches a prompt, tool-calling agent, or RAG pipeline, untrusted input can act as instructions. Check for prompt injection exposure (OWASP LLM Top 10 #1), tool-permission scoping (does the agent have more access than the task needs?), system-prompt leakage, and unvalidated retrieval chunks. Run /security-review for the full trust-boundary pass.

Style and consistency

Does the code follow the project's existing patterns?
Are imports organized consistently?
Is formatting consistent with the rest of the codebase?

Step 3: Generate the review

Output a structured review:

Code Review: (PR title or description)

Summary

(One sentence: Approve / Approve with suggestions / Request changes)

Must-fix issues

(Issues that should be resolved before merging)

[file:line] (issue) -- (why it matters) -- (suggested fix)

Suggestions

(Improvements that would make the code better but aren't blockers)

[file:line] (suggestion) -- (rationale)

Acceptance criteria coverage

Criterion	Covered?	Notes
(AC from story)	Yes/No/Partial	(details)

Test coverage assessment

(What's well tested)
(What's missing)

Test quality

Gherkin scenario	Test exists?	Test type	Quality
(AC scenario)	Yes/No	Unit/Integration/E2e	Adequate / Needs improvement / Missing

Related skills: Use /test-coverage-audit for a full AC-to-test mapping across the codebase.

Architecture impact

(Assessment of coupling, data flow, observability additions)
(Any concerns about system-level implications of this change)

What's good

(Specific things done well -- call out clean patterns, good naming, thorough error handling)

Step 4: Discuss

Ask the user:

Any issues you'd like me to explain further?
Want me to write fixes for the must-fix issues?
Want me to write the missing tests?

Step 5: Team-level review standards (for engineering leads)

If the user is setting up or improving code review culture for a team, generate team review standards:

## Code Review Standards -- {{team name}}

### Review SLOs
| Change type | Target turnaround | Escalation if missed |
|------------|------------------|---------------------|
| Bug fix (< 50 lines) | 4 business hours | Author pings reviewer directly |
| Feature (50-300 lines) | 1 business day | Raise in standup |
| Large change (300+ lines) | 2 business days | Split the PR or schedule synchronous review |
| Security-sensitive | 4 business hours | Ping security champion + lead |

### Review depth by change type
| Change type | Expected depth | Focus areas |
|------------|---------------|-------------|
| Bug fix | Targeted -- verify the fix, check for regression risk | Root cause addressed (not just symptom), test coverage for the bug |
| New feature | Thorough -- full review of all dimensions | AC alignment, test coverage, error handling, security |
| Refactor | Structure-focused -- verify behavior preservation | Tests still pass, no behavior changes, improved readability |
| Dependency update | Risk-focused -- check for breaking changes | Changelog review, breaking API changes, security advisories |
| Config/infra change | Impact-focused -- blast radius assessment | What environments affected, rollback path, monitoring |

### Reviewer assignment
- **Round-robin** for general changes -- spread knowledge, avoid bottlenecks
- **Domain expert** for security, performance, or architecture changes
- **Two reviewers** for changes touching shared libraries, APIs, or data models
- **Pair review** (synchronous) for changes > 500 lines or cross-cutting concerns

### Review health metrics
Track quarterly:
- **Review cycle time:** Median time from PR opened to approved (target: < 1 business day)
- **Review throughness:** % of PRs with at least one substantive comment (not just "LGTM")
- **Knowledge distribution:** Number of unique reviewers per codebase area (avoid single-reviewer silos)
- **PR size distribution:** Track % of PRs > 300 lines (goal: shrink over time)

Ask the user:

Does the team have existing review norms this should build on?
Any pain points with current review practices (bottlenecks, rubber-stamping, nitpick wars)?
Want me to draft a team-specific version based on your current workflow?

Output location

Present the review as formatted text in the conversation.

Example Output

Input

Code changes: Diff for a new /api/v1/payments/refund endpoint in a FastAPI service — adds a POST route, a RefundService class, and a Pydantic request model
Story/AC: "As a support agent, I can issue a full or partial refund for a completed order so that customer disputes are resolved without manual DB edits. AC: (1) Given a valid order ID and refund amount ≤ original charge, the refund is processed and a refund.created event is emitted. (2) Given an amount exceeding the original charge, a 422 is returned with a clear error message. (3) Given an already-refunded order, a 409 is returned. (4) All refund attempts are logged with agent ID, order ID, amount, and outcome."
Test files: tests/test_refund_endpoint.py — 3 test functions covering the happy path and the over-amount case
Team conventions: PEP 8, no bare except, all external calls must use the internal http_client wrapper, secrets via settings object only

Output (abbreviated)

Code Review: Add POST /api/v1/payments/refund

Summary

Request changes — two must-fix security and error-handling issues before merging; AC coverage is incomplete, and one test scenario is missing.

Must-fix issues

services/refund_service.py:34 Bare except clause swallows all exceptions from the Stripe API call — violates team convention and hides failures silently. Wrap in except StripeError as e and re-raise as a structured PaymentGatewayError with the original message and request ID attached.
routers/refund.py:18 refund_amount is read directly from the request body and passed to Stripe without validating it is a positive number. A value of 0 or -50.00 will reach the external API. Add a Pydantic @validator enforcing amount > 0.
services/refund_service.py:61 Stripe API key is accessed via os.environ["STRIPE_SECRET_KEY"] — violates the team convention requiring secrets through the settings object. Change to settings.stripe_secret_key.

Suggestions

services/refund_service.py:45–72 process_refund() is doing three things: validating business rules, calling Stripe, and emitting the event. Extract into _validate_refund_eligibility() and _emit_refund_event() — easier to test and reason about.
routers/refund.py:12 agent in the request model would be clearer as agent_id: UUID — matches the naming used everywhere else in the codebase and enforces the type at the boundary.
services/refund_service.py:58 Consider returning the Stripe refund_id in the 200 response body so callers can correlate with Stripe's dashboard without a separate lookup.

Acceptance criteria coverage

Criterion	Covered?	Notes
Valid refund processed + `refund.created` event emitted	Partial	Refund is processed; event emission is implemented but not verified in tests
Amount > original charge → 422 with clear message	Yes	Validated in model and service; error message is descriptive
Already-refunded order → 409	No	No check against existing refund status in `process_refund()`; no test
All attempts logged with agent ID, order ID, amount, outcome	No	Logging call on line 67 omits `agent_id` and doesn't log failure outcomes

Test coverage assessment

Well tested: Happy path (201 response, correct payload shape) and over-amount validation (422) are covered with clear assertions
Missing: No test for the duplicate-refund 409 case (AC #3); no test for Stripe API failure (500 vs. 502 distinction); no test verifying the refund.created event is actually emitted; edge case of amount = 0 not covered

Test quality

Gherkin scenario	Test exists?	Test type	Quality
Valid full/partial refund processed	Yes	Unit (mocked Stripe)	Adequate
Amount exceeds original charge → 422	Yes	Unit	Adequate
Already-refunded order → 409	No	—	Missing
All attempts logged with required fields	No	—	Missing

Related skills: Use /test-coverage-audit for a full AC-to-test mapping across the payments service.

Architecture impact

Coupling: RefundService instantiates its own stripe.Client internally — makes the dependency hard to swap for testing and couples the service layer to the SDK. Inject the client via the constructor or use the existing http_client wrapper per team convention.
Observability: No structured log on the success path includes the Stripe refund_id, meaning production incidents can't be correlated to Stripe events without manual lookup. Add it to the success log line.
Migration safety: No schema changes — low risk. The refund.created event is a new event type; confirm the event bus consumer team is aware before merging to avoid silent drops.

What's good

The Pydantic request model is clean and uses Decimal for amount rather than float — correct choice for currency
409 vs. 422 status code distinction is semantically accurate (conflict vs. validation error)
The router stays thin — business logic is correctly pushed into RefundService, not the route handler
Dependency on settings is used correctly everywhere except the one flagged violation

Want me to:

Write fixes for the three must-fix issues?
Generate the two missing test cases (409 duplicate refund + event emission assertion)?
Explain the event bus concern in more detail?

Run this now

Try /code-review on your own input

0/4000

Part of these Playbook topics

Pairing Test-Driven Development CI/CD

Related Engineering skills

ADR Generate AI Testing Strategy Architecture Context Reviewer Architecture Discovery Boris Model Build vs Buy Codependency Analyzer Debug Assist

Back to Skills Catalog