Transcript Test Protocol >1 hour

The fastest test of transcription capabilities, with 1 transcript to start, and endlessly scalable.

My typical test with this protocol -

1st round: 1 audio file → transcript

2nd round: 3 languages x 3 audio files for each x ****3 tools tested in one go ≈2 hours for full process

</aside>

Complete?	Phase	Max min	What you do	Output + documentation

| ✅ | A. Setup | 5 min | • Pick 1 interview recording (can be audio-only)

• Create a new Test folder where you’ll document the test files and results **** | Folder created

Saved original .mp3/4/.wav… | | --- | --- | --- | --- | --- |

| ✅ | B. Tool 1 - CONTROL - “Gold Standard” Transcript [Or simply choose Test Tool #1] | 2 min | • Run audio file through a trusted tool or use a transcript you transcribed manually for the exact interview recording •

• Save as “[File name details - Control]” in your Test folder | Transcript #1: One “gold standard” / control transcript saved in test folder | | --- | --- | --- | --- | --- | | | C. Test Tool 2 | 2 min | • Drag the same audio file into Tool #2. • Wait for it to finish processing. • Download or copy-paste complete transcript into document | Transcript #2 | | | D. Test Tool 3 | 2 min | • Drag the same audio file into Tool #2. • Wait for it to finish processing. • Download or copy-paste complete transcript into document | Transcript #3 |

| ✅ | E. Save all transcripts | 5 min | Save all 3 transcripts in test folder ”[File name details - Tool X name]” | | | --- | --- | --- | --- | --- |

| ✅ | F. Manual Check | 15 min | Human spot-check: • Skim for typical issues: names, numbers, domain jargon, language problems…

• What are you particularly interested in? Finding a tool that accurately captures brand names, handles language/dialects in specific contexts, slang? | Note your thoughts in a document - Save in Test folder | | --- | --- | --- | --- | --- |

| ✅ | G. WER Eval in Claude | 15 min | Automated (fast):

• Run a quick Word-Error-Rate (WER) script comparing Gold transcript vs. other tools. | WER error % and standout error examples.

Notes about errors Claude caught, etc.

Verdict: which tool performed best. | | --- | --- | --- | --- | --- |

| ✅ | H. Final Verdict | 10 min | Compare your thoughts with the WER results.

Identify the top errors, and how much they matter to your work.

Consider: How much time might it take to deal with these errors in manual cleaning, how big a deal is this if all our transcripts are like this? | One-pager noting issues and wins per tool, and final verdict about each - **go/no-go.

Save this in test folder, and anywhere else your team needs to see it.** | | --- | --- | --- | --- | --- |