LLM used: Claude

Transcripts from: MacWhisper Pro (T1), Grain (T2), NotebookLM (T3)

Prompt used:

“Act as a master of transcript accuracy and comparison.

Run Python: jiwer.compare to compare the words transcribed in the [#] transcripts uploaded.”

Note: In my own evaluation, I found a few more errors in NotebookLM’s transcription that Claude didn’t identify as errors because all 3 made mistakes, but NotebookLM was farthest from the actual verbatim statements.

Everything below is Claude’s output -

</aside>

Transcript Accuracy Comparison Report

Executive Summary

I analyzed three different transcripts of the same interview recording using word-level accuracy metrics and content similarity analysis. All transcripts achieved high content similarity (85-90%), but each showed distinct characteristics in terms of accuracy, completeness, and formatting.

Transcript Sources

T1: Whisper transcription service
T2: Grain transcription service
T3: NotebookLM transcription service

Quantitative Analysis

Word Count Statistics

T1: 839 words (longest)
T2: 699 words (most concise)
T3: 756 words (moderate length)

Content Similarity (Jaccard Index)

T1 vs T2: 85.57% similarity
T1 vs T3: 87.66% similarity
T2 vs T3: 90.21% similarity (highest)