Scorecard Best Practices
Last updated: June 18, 2026
Why this guide exists: A scorecard is a prompt. When you write a scorecard, you are instructing an AI judge on exactly what "good" looks like. The AI is excellent at following clear instructions and bad at guessing your intent. Most scoring problems — "too harsh", "too lenient", "inconsistent" — are not AI bugs. They are ambiguity in the scorecard. This guide shows you how to remove that ambiguity.
How Solidroad reads your scorecard
Before writing criteria, it helps to know what happens under the hood. Understanding this makes every best practice below obvious.
A scorecard is made of sections (each section is one thing you want to measure). For every conversation, the AI reads the full transcript plus your scorecard, then scores each section, writes an explanation and a quoted example, and rolls everything up into a single 0–100% score.
There are three section types:
Section type | Use it for | How it scores |
|---|---|---|
Graded (spectrum) | Soft skills — empathy, tone, rapport, clarity | A number on a short scale (use 0–5, not 0–10), based on poor / average / strong descriptions you write |
Pass / fail (binary) | Objective steps — verifying an account, giving correct info, following a required step | The step either happened correctly (full marks) or it didn't (0) |
Process-linked | Multi-step SOPs and procedures | Scored against a process document you attach; can be auto-detected at runtime |
Sections also support three advanced controls:
Exclusion / N/A criteria — if a condition is met, the section is marked N/A and removed from the score entirely (e.g. "mark N/A if the customer never asked about billing"). This prevents reps being penalised for steps that were never relevant.
Section fail criteria — forces that one section to 0.
Scorecard fail criteria — forces the entire scorecard to fail regardless of other scores (e.g. a compliance breach).
The score is a percentage of points earned ÷ points possible. Sections marked N/A or excluded are dropped from both sides of that calculation, so they never drag a score down.
The golden rule: write for the AI, not for a human reviewer
A human reviewer can read "Rep was professional" and apply judgement. The AI needs the judgement written down. Every rule in this manual is a variation of one idea:
Replace anything subjective or implied with a specific, observable behaviour the AI can find in the transcript.
The ten rules for writing scoring criteria
Rule 1 — One behaviour per line
Each scoring item should test exactly one thing, so it can be scored independently and the feedback can be precise.
🚫 "Rep confirms the customer's details, explains the product, and offers a discount."
✅ "Rep confirms the customer's details." / "Rep explains the product." / "Rep offers a discount."
Rule 2 — Use statements, not questions
Write what the rep does, in a complete sentence.
🚫 "Did the rep confirm the problem?"
✅ "Rep confirms the customer's problem before proposing a fix."
Rule 3 — Avoid subjective words; describe the behaviour instead
Words like professional, helpful, clear, good mean nothing to the AI unless you define them.
🚫 "Rep was professional."
✅ "Rep uses a polite greeting, avoids slang, and maintains a respectful tone throughout."
Rule 4 — Match your level of detail to your scoring intent
The more specific you are, the more exactly the AI checks the transcript against your wording.
Generic (catch-all): "Rep explains the warranty terms."
Specific (exact-match): "Rep states the warranty covers parts and labour for 12 months."
Use generic phrasing when any reasonable version counts; use specific phrasing when the exact content matters.
Rule 5 — Use If / Then for things that may not happen
Many behaviours are only required if something occurs. Spell out the trigger and the required response — this is also how you stop the AI penalising reps for situations that never came up.
"If the customer expresses frustration, then the rep acknowledges the emotion before proceeding."
"If the customer mentions a competitor, then the rep follows the competitive-handling script."
Rule 6 — Use AND / OR / AND-OR to combine requirements
AND — every step is required: "Rep shares the Help Centre" AND "Rep logs a support ticket" AND "Rep confirms the ticket reference."
OR — only one acceptable path: "Rep verifies via email" OR "Rep verifies via account ID and phone number."
AND/OR — one or more is fine: "Rep provides a product guide" AND/OR "shares a tutorial video" AND/OR "schedules a follow-up."
Rule 7 — Anchor graded items to real examples
For any graded (spectrum) section, give the AI a transcript-style example of poor, average, and strong. This is the single biggest fix for "the AI is too harsh" and for "I don't know what a 3 vs a 4 means."
Example — Empathy:
Poor: "That's not my problem."
Average: "I understand that's frustrating."
Strong: "I understand that's frustrating, and I'll take care of it for you right now."
Rule 8 — Include the keywords you expect to see
If you expect certain phrasing, name it — and list acceptable synonyms so correct answers aren't marked wrong.
"Rep says 'shipping date' or 'delivery date' when confirming the timeline."
Rule 9 — Order items in the sequence they'll occur
Arrange items roughly in the order they'd appear in a real interaction (greeting → verification → resolution → close). This helps the AI follow the conversation and reduces context confusion.
Rule 10 — Keep structure consistent
Start each item with a fixed stem like "Rep must…" and put each requirement in its own short paragraph or bullet. Consistent, well-separated criteria parse far more reliably than dense blocks.
Choosing graded vs pass/fail (and what scale to grade on)
A balanced scorecard usually mixes both. Use this as your default:
If the behaviour is… | Use | Because |
|---|---|---|
A soft skill (empathy, tone, rapport, clarity, active listening) | Graded | You want how well, not just whether — nuance matters |
An objective step (verify identity, give correct policy, follow a required action) | Pass / fail | It either happened correctly or it didn't — no middle ground |
A compliance must-do (disclosures, authentication, never discussing X) | Pass / fail + scorecard-fail criteria | One miss should fail the whole interaction |
A documented multi-step procedure | Process-linked | The AI scores against the actual SOP rather than you re-typing every step |
What scale should graded sections use? Default to 0–5, not 0–10. The width of the scale changes how reliably the AI scores: on a wide scale it can't meaningfully separate a 7 from an 8, so fine-grained scores drift between runs and bunch up in the middle (lots of 6s and 7s, almost no 1s or 10s) — the same "too harsh / inconsistent" pattern customers report. This matches the wider evidence on AI judges: controlled studies find a 0–5 scale gives the strongest agreement with human reviewers, while 0–10 is consistently the weakest, and practitioner guidance favours the coarsest scale that still captures the nuance you need.
Default graded sections to 0–5. Reliable enough to trust, granular enough to coach against. Keep graded scoring for genuine spectrums — empathy, tone, rapport, clarity — where how well matters, not just whether.
For anything objective, prefer pass/fail. Binary is the most consistent call an AI can make: there's no "is this a 3 or a 4?" to drift on, and the verdict is directly actionable. Where a behaviour is genuinely multi-part, split it into several pass/fail checks rather than widening the scale.
The trade-off to weigh: pass/fail gives no partial credit and can read as blunt on nuanced behaviour, so calibrate binary criteria carefully (see §7) and keep genuine spectrums on 0–5 rather than forcing everything to yes/no.
Structuring the whole scorecard
Group into clear sections, one topic each. Keep distinct skills in separate sections — e.g. active listening and open-ended questions should not share a section, or the AI blends them and the feedback gets muddy. Customers who score well typically run focused single-topic sections rather than one giant catch-all.
Don't over-stuff. More sections give more granular feedback, but every section you add is one more thing to get right and to calibrate. Start lean (3–6 sections is a healthy starting point), prove it scores well, then expand. If you need 20 metrics, consider whether some belong on a separate scorecard.
One scorecard per topic, reused widely. A well-built "Budget Qualification" or "Identity Verification" scorecard can be reused across many simulations and queues, with small topic-specific additions, rather than rebuilt each time.
Set weights deliberately. Points are your priorities made visible. Give compliance and core-outcome sections more weight than nice-to-haves.
Use exclusion/N/A criteria generously. They're how you keep scores fair when not every step applies to every conversation.
The five mistakes that cause bad scores
These are the real, recurring causes of "the scoring is wrong" — drawn from live customer setups.
Subjective adjectives with no definition (professional, good, clear). The AI guesses, and guesses vary. → Define the behaviour (Rule 3).
Several actions crammed into one item. The AI can only give one score, so partial completion becomes unpredictable. → One behaviour per line (Rule 1).
Overlapping sections testing the same thing. Causes double-counting and confusing feedback. → Make each section distinct.
Asking the AI to check something it can't see. Referencing systems, screens, CRM fields, or internal actions that aren't in the transcript. Only score what's observable in the conversation (or attach the data/process as a source). → If it's not in the transcript, it can't be scored.
Requiring steps that didn't apply. Penalising a rep for not de-escalating a calm customer. → Use If/Then (Rule 5) and exclusion criteria.
Calibrate before you go live
Never trust a brand-new scorecard on live data. Validate it first.
Use Testing mode. Run the scorecard against a set of real, already-reviewed conversations before setting it live.
Compare AI vs human. Look at where the AI score and your human score disagree. Each disagreement points at an ambiguous criterion.
Edit the scorecard, not your expectations. Tighten the wording, add an anchor example, add an exclusion rule — then re-run.
Aim for ~90%+ agreement with your human reviewers before going live. Several customers hold this as their bar.
Keep calibrating monthly once live. Calibration is an ongoing habit, not a one-time setup — and it's best done with more than one reviewer to remove individual bias.
Important: Editing a scorecard does not retroactively change past scores. To apply changes, re-run the evaluation against the updated scorecard. Testing mode lets you validate changes safely before pushing them live.
Maintaining scorecards over time
Iterate from patterns, not one-offs. If you see the same over- or under-scoring repeatedly, fix the criterion. Don't rewrite a scorecard off a single odd result.
Re-run after every change so historic and new evaluations are comparable.
Version intentionally. Note what you changed and why, so a score can always be traced to the rubric that produced it.
Prune. Retire sections that never discriminate between good and bad reps — they add noise and calibration cost.
Quick-start checklist
Before you set a scorecard live, confirm:
Every item tests one behaviour
No subjective words left undefined
Soft skills are graded on 0–5 with poor/average/strong anchor examples
Objective + compliance steps are pass/fail (compliance → scorecard-fail)
If/Then used for any behaviour that's conditional
Exclusion/N/A rules added so irrelevant steps don't penalise reps
Sections are distinct — no two test the same thing
Nothing asks the AI to judge something outside the transcript
Weights reflect real priorities
Validated in Testing mode to ~90%+ agreement with human scores
Copy-paste section templates
Graded soft-skill section (Empathy)
Name: Empathy
Type: Graded (0–5)
Criteria: Rep acknowledges the customer's emotion before moving to a solution.
Anchors:
- Poor (0–1): Ignores or dismisses the emotion. e.g. "That's not my problem."
- Average (2–3): Names the emotion. e.g. "I understand that's frustrating."
- Strong (4–5): Names the emotion AND commits to act. e.g. "I understand that's frustrating, and I'll fix it now."
Pass/fail compliance section (Identity Verification)
Name: Identity Verification
Type: Pass/Fail
Criteria: Rep must verify the customer before discussing account details.
- Rep verifies via registered email
OR
- Rep verifies via account ID AND phone number
Scorecard-fail: If account details are shared before verification, fail the scorecard.
Conditional section (De-escalation)
Name: De-escalation
Type: Graded (0–5)
Exclusion (N/A): Mark N/A if the customer never expresses frustration or anger.
Criteria: If the customer expresses frustration, then the rep acknowledges it, apologises where appropriate, and slows the pace before proposing next steps.