Usability Test Plan - AI Agent Skill

Use this when you have a prototype, design, or live product and need to plan a usability test. Produces a complete test plan: objectives, task scenarios, success metrics, moderator guide, and observer briefing. This is the evaluative counterpart to /interview-plan (which covers generative research).

Related skills: Pairs with /screener-design for participant recruitment. Feeds into /research-synthesize for post-test analysis. For generative interviews, use /interview-plan instead. For heuristic evaluation without users, use /ux-audit.

Process

Step 1: Gather inputs

Ask the user to provide:

What's being tested -- prototype, live product, specific flow, or concept. Include links or screenshots if available. By 2026 the prototype itself is often AI-generated (Figma Make, v0, Lovable), which means the first thing to test is whether the generated output holds up: edge states, error handling, accessibility, and brand fidelity are exactly where generated UI tends to be thin, so call these out as priority watch areas if the build came from a prompt.
Research questions -- what do you need to learn? (e.g., "Can users complete the onboarding flow without assistance?" or "Where do users get confused in checkout?")
User segment -- who should participate? Role, experience level, familiarity with the product.
Known issues -- anything the team already suspects is broken? (So we can validate, not just discover.)
Decisions this informs -- what will you do differently based on the results? (If nothing, don't test.)
Constraints -- moderated vs. unmoderated, remote vs. in-person, timeline, budget, tools available (e.g., Maze, UserTesting, Lookback).

Step 2: Define test objectives and metrics

## Usability Test Plan -- (Product/feature, date)

### Objectives
1. (Primary objective -- the most important thing to learn)
2. (Secondary objective)
3. (Tertiary objective -- nice-to-have)

### Success metrics

| Metric | Target | How measured |
|---|---|---|
| Task completion rate | (e.g., 80%+ complete without help) | Binary: completed / failed / completed with help |
| Time on task | (e.g., under 3 minutes for core flow) | Stopwatch from task start to completion |
| Error rate | (e.g., fewer than 2 wrong paths per task) | Count of incorrect actions before recovery |
| Satisfaction | (e.g., 4+ on 5-point post-task rating) | Post-task questionnaire |
| Critical issues found | (e.g., 0 blockers in core flow) | Issues that prevent task completion |

Step 3: Write task scenarios

Design 4-7 task scenarios. Each scenario should:

Describe a realistic goal, not a set of instructions ("You want to invite a teammate to your project" not "Click the settings gear, then click Team Members")
Start with context that grounds the participant in a realistic situation
Be completable in 2-5 minutes
Not reveal the expected path

### Task Scenarios

**Task 1: (Task name)**
- **Scenario:** "(Context and goal written as the participant would experience it -- e.g., 'You just signed up for the product and want to set up your first project. Start from the dashboard.')"
- **Success criteria:** (What counts as successful completion)
- **Metrics to capture:** (Completion, time, errors, satisfaction)
- **Watch for:** (Specific behaviors or decision points to observe)

**Task 2: (Task name)**
(Same format)

Task design rules:

Order tasks from simple to complex. Build participant confidence before testing harder flows.
Include at least one task that tests error recovery (e.g., "You accidentally deleted something -- now recover it")
Include one open-ended exploration task if time allows ("Look around and tell me what you think this product does")
Avoid jargon the product uses internally -- use the participant's language

Step 4: Write the moderator guide

### Moderator Guide

**Before the session (5 min)**
- Welcome and thank participant
- Explain the session: "We're testing the product, not you. There are no wrong answers."
- Get consent for recording
- Ask participant to think aloud: "Tell me what you're thinking as you go through each task."
- Confirm their background matches the screener (1-2 quick questions)

**Warm-up (3-5 min)**
- (1-2 questions about their current workflow related to the product area)
- (Goal: establish comfort and get baseline context)

**Task execution (25-40 min)**
- Read each task scenario aloud, then provide it in writing (paste in chat or hand a card)
- After each task:
  - "On a scale of 1-5, how easy was that?" (post-task rating)
  - "What were you expecting to happen when you clicked X?" (if they hesitated)
  - "Was there anything confusing about that?" (open-ended)
- Between tasks: brief pause. Note any spontaneous comments.

**Debrief (5-10 min)**
- "Which task was the hardest? Why?"
- "What would you change about this product?"
- "Is there anything you expected to find that wasn't there?"
- "Any final thoughts?"

**Moderator rules:**
- Do not help unless the participant is completely stuck for 60+ seconds
- If stuck, offer one neutral prompt: "What would you try next?" before giving a hint
- Never explain the interface. If they ask "What does this button do?" respond: "What do you think it does?"
- Note timestamps for key moments (confusion, delight, errors)
- Record non-verbal cues: hesitation, sighing, re-reading, backtracking

Step 5: Write the observer briefing

### Observer Briefing

**Your role:** Watch silently. Take notes. Do not interact with the participant.

**What to capture:**
- Task completion: did they finish? How long? What path did they take?
- Errors: wrong clicks, backtracking, confusion
- Quotes: verbatim things they say while thinking aloud
- Body language: hesitation, frustration, surprise, delight
- Workarounds: unexpected paths they invent

**Note-taking template (one per task per participant):**

| Field | Notes |
|---|---|
| Task # | |
| Completed? | Yes / No / With help |
| Time | |
| Path taken | (sequence of actions) |
| Errors | |
| Key quote | |
| Observations | |
| Severity of issues | Critical / Major / Minor |

Step 6: Logistics and schedule

### Test logistics

| Item | Detail |
|---|---|
| Format | Moderated / Unmoderated |
| Setting | Remote / In-person |
| Tool | (Zoom, Maze, UserTesting, Lookback, etc.) |
| Participants | (Number -- typically 5-6 per round) |
| Session length | (45-60 min typical) |
| Schedule | (Date range, sessions per day -- max 3-4 to avoid moderator fatigue) |
| Recording | (Video, audio, screen -- confirm consent process) |
| Recruitment | (Use /screener-design for participant screening) |

Step 7: Plan the AI-assisted analysis pass

Decide before the first session how recordings and notes become findings. As of 2026, AI synthesis has crossed from gimmick to genuine leverage: tools like Dovetail auto-transcribe and theme-tag sessions, and AI-native repositories like Notably and Marvin auto-cluster repeated themes and draft insight summaries. This roughly halves session-to-insight time and collapses the manual affinity-mapping that used to eat a full day of sticky-note sorting.

Plan it like this:

### Analysis plan

| Step | How | Owner |
|---|---|---|
| Transcription | Auto-transcribe in-tool (Dovetail) or upload recordings | Tool |
| First clustering pass | AI auto-tags themes and groups repeated patterns | Tool |
| Human verification | Researcher reviews every cluster against raw clips | You |
| Severity calls | Human assigns Critical / Major / Minor (AI does not) | You |
| Decision mapping | Tie each verified finding back to the decision it informs | You |

Rules for the AI pass:

AI does the first clustering pass, never the final call. Treat auto-generated themes as a draft to verify, not a verdict to ship.
Every theme the tool surfaces must trace back to a real clip or quote before it makes the report. AI clustering can over-merge distinct problems or invent a pattern from two coincidental phrasings, so spot-check the source for each cluster.
Severity (Critical / Major / Minor) stays human. Whether a snag blocks the core flow is a judgment call, not a tagging task.
Hand the verified clusters to /research-synthesize for the full write-up.

Step 8: Review and validate

Ask the user:

Are the task scenarios realistic? Would a real user encounter these goals?
Is the success criteria clear enough that two observers would agree on pass/fail?
Are we testing the right things? (The hardest part of usability testing is testing the right questions, not running the sessions.)
Anything missing from the observer briefing?

Output location

Present the test plan as formatted text in the conversation. The user copies it into their research documentation or shares with the team.

Example Output

Input

What's being tested: Mid-fidelity Figma prototype of a redesigned invoice creation flow in Meridian Financial's B2B payments platform (desktop web). Link to prototype available internally.
Research questions: Can finance managers create and send a custom invoice without assistance? Where do line-item editing and tax configuration cause errors or drop-off?
User segment: Finance managers and AR specialists at mid-market companies (50–500 employees), moderate SaaS familiarity, some prior experience with QuickBooks or FreshBooks
Known issues: Team suspects the "Add Tax Rule" modal is confusing — it requires users to set a jurisdiction before a rate, but the UI doesn't communicate that dependency. Also unsure if "Save as Draft" vs. "Send Now" distinction is clear.
Decisions this informs: Whether to ship the redesigned flow in Q3 or return to iteration. Tax modal may be redesigned or replaced with a simpler inline input.
Constraints: Moderated remote sessions via Zoom + Lookback, 6 participants, sessions in the next 2 weeks, ~60 minutes each

Output

Usability Test Plan — Meridian Financial Invoice Creation Flow, June 2025

Objectives

Primary: Determine whether finance managers can complete the end-to-end invoice creation and send flow without moderator assistance
Secondary: Identify where users encounter errors or confusion in line-item editing and tax rule configuration
Tertiary: Assess whether the Save as Draft vs. Send Now distinction is understood and used as intended

Success Metrics

Metric	Target	How measured
Task completion rate	80%+ complete core flow without help	Binary: completed / failed / completed with help
Time on task (full invoice)	Under 6 minutes	Stopwatch from task start to first send confirmation
Error rate on tax configuration	Fewer than 2 wrong actions before correct input	Count of mis-taps, wrong field entries, modal re-opens
Post-task ease rating	4+ out of 5 for core flow	Single-question post-task scale
Critical blockers	0 in core send flow	Issues that fully prevent task completion
Draft vs. Send confusion	Fewer than 3 participants choose wrong action	Observed behavior at the final action step

Task Scenarios

Task 1: Orientation

Scenario: "You've just logged into Meridian for the first time this month. Take a minute to look around this screen and tell me what you think you can do here."
Success criteria: Participant can describe the general purpose of the dashboard and locate the invoice area unprompted
Metrics to capture: Time, verbal description accuracy, navigation path
Watch for: Whether "New Invoice" CTA is noticed immediately or buried in their scan

Task 2: Create a Basic Invoice

Scenario: "A client, Hargrove Construction, has asked for an invoice for two services: a $4,200 consulting retainer and a $950 setup fee. Create an invoice for them and add both line items."
Success criteria: Both line items entered with correct amounts; client name applied
Metrics to capture: Completion, time, errors on line-item entry (especially editing an existing row)
Watch for: Whether users try to edit a line item inline or look for a separate edit button; confusion with the "+" icon vs. "Add Line" text link (known duplication in the UI)

Task 3: Apply a Tax Rule

Scenario: "Hargrove Construction is based in Ontario, Canada. You need to apply the applicable HST rate to both line items before sending."
Success criteria: HST (13%) applied to both line items via the tax modal
Metrics to capture: Completion, error count, time in modal, post-task ease rating
Watch for: Whether users attempt to enter the rate before selecting jurisdiction; whether they re-open the modal or abandon; verbal expressions of confusion or expectation mismatch

Task 4: Save for Later

Scenario: "Your manager wants to review this invoice before it goes out. Save it so you can come back and send it after they've approved it."
Success criteria: Invoice saved as Draft (not sent)
Metrics to capture: Completion, whether Send Now is chosen accidentally, any hesitation at the action buttons
Watch for: Eye movement or hovering between "Send Now" and "Save as Draft"; participants who save correctly but express uncertainty about whether it actually saved

Task 5: Send the Final Invoice

Scenario: "Your manager has approved the invoice. Now send it to Hargrove Construction's billing contact, billing@hargroveconstruction.com."
Success criteria: Invoice sent to correct email address; confirmation state reached
Metrics to capture: Completion, time, any errors adding the recipient email
Watch for: Whether users navigate back to the draft naturally or need to search for it; confusion at the recipient entry field if it doesn't pre-populate the client contact

Task 6: Error Recovery

Scenario: "You just realized you sent an invoice with the wrong setup fee — it should have been $1,150, not $950. What would you do?"
Success criteria: Participant attempts to locate the sent invoice and find an edit or void option (success defined as reaching the correct screen, even if edit is not available in prototype)
Metrics to capture: Path taken, time, verbal problem-solving
Watch for: Whether users look for an "Edit" button on the sent invoice, try to duplicate and re-send, or express that they expect this capability to exist but can't find it

Moderator Guide

Before the session (5 min)

Welcome and thank the participant; introduce yourself and any silent observers
"We're testing the product today, not your skills — there are no wrong answers. If something is confusing, that's valuable information for us."
Confirm recording consent; explain Lookback will capture screen and audio
"As you go through each task, please think out loud — tell me what you're looking at, what you're expecting, and what's going through your mind."
Quick screener confirmation: "Just to confirm, do you currently handle invoicing or accounts receivable in your role?"

Warm-up (3–5 min)

"Walk me through how you typically create an invoice today — what tool do you use, and how long does it usually take?"
"What's the most frustrating part of that process?"
(Goal: calibrate language, establish comfort, surface mental models before they see the prototype)

Task execution (30–40 min)

Read each scenario aloud, then paste the text into the Zoom chat
After each task:
- "On a scale of 1–5, how easy was that?" → note number and any unprompted explanation
- If hesitation observed: "I noticed you paused there — what were you thinking?"
- If they clicked something unexpected: "What were you expecting to happen when you did that?"
- "Was there anything about that task that surprised you?"

Debrief (8–10 min)

"Which task felt the hardest? What made it difficult?"
"Was there anything you expected to find that wasn't there?"
"If you could change one thing about creating an invoice in this product, what would it be?"
"Any final thoughts before we wrap up?"

Moderator rules:

Do not intervene unless the participant has been fully stuck for 60+ seconds
First neutral prompt: "What would you try next?"
Second prompt (if still stuck): "Is there anywhere else you might look for that?"
Never confirm or correct. If they ask "Is this right?" respond: "What do you think?"
Timestamp every error, long pause (5+ sec), and spontaneous comment
Note non-verbal signals: re-reading instructions, sighing, laughing, leaning toward screen

Observer Briefing

Your role: Silent observation. Take structured notes. Do not speak, react, or make eye contact with the participant during tasks.

What to capture:

Exact path taken (sequence of clicks/actions), not a summary
Verbatim quotes while thinking aloud — especially around the tax modal and send/draft step
Errors: wrong clicks, modal re-opens, backtracking, form re-entry
Moments of hesitation (3+ seconds without action)
Any workarounds or invented paths not anticipated by the design

Note-taking template (one per task per participant):

Field	Notes
Participant ID
Task #

Run this now

Try /usability-test-plan on your own input

0/4000

Part of these Playbook topics

Agentic UX Mentoring Someone Into the Craft

Related UX Research skills

Accessibility Audit Assumption Map Card Sort Plan Competitive UX Benchmark Concept Test Plan Diary Study Plan Interview Plan Interview Script

Back to Skills Catalog