How a Claude Code skill caught the bias in my PM interviews

May 4, 2026

I do a lot of customer discovery calls. The transcripts pile up faster than I can read them, and I kept hitting the same problem. By the time I sat down to extract a real insight, I realized I might have inadvertently introduced or added bias to the conversation.

So I built a Claude Code skill called customer-discovery. Here’s what it does, what it taught me about my own interviewing, and a few things that broke along the way.

Why does interviewing more customers stop helping after a while?

Memory decays fast. Ebbinghaus’s forgetting curve puts you at roughly 50% recall an hour after a conversation, around 30% by the next day, and under 10% by the end of the week. A founder doing four 45-minute discovery calls a week is forgetting 90% of what they heard before they get back to the deck on Monday.

Transcription solves the capture problem. Google Meet’s Gemini notes, Otter, Granola, Fireflies, pick one. The bottleneck just moves. You now have 30+ transcripts of around 15,000 characters each, and “what did CPOs say about competitive intel” is a haystack problem instead of a memory problem.

So the obvious move is to paste them into Claude and ask. That’s where it started getting weird.

What goes wrong when you point an LLM at raw discovery transcripts?

Three things broke for me. The biggest is that the LLM was citing my own framing back to me as if it came from the customer.

Every one of my calls follows the same arc. I open with discovery questions, then around minute 22 I say something like “let me share what I’ve heard from other CPOs” and shift from listening to discussing my own hypothesis. The next 20 minutes are no longer pure discovery. When you ask an LLM to summarize “the pains my interviewees mentioned,” it pulls quotes from both halves. The customer’s polite engagement with my hypothesis shows up as a customer pain.

Token contamination. 46 transcripts at ~15k chars each is over half a million characters. You can’t read them all into the main context, and even with a long-context model you’re shovelling 200k tokens of raw text through every query. The signal-to-noise on a single question is brutal.

No citations. The summaries sound smart. They’re also impossible to audit. “Many CPOs mentioned roadmap pressure from sales”, which CPOs? Which call? Which sentence? Without a footnote you can’t tell whether the model found a pattern or invented one. The Mom Test warns about three kinds of misleading data: compliments, fluff, and ideas. An ungrounded LLM summary is all three at once.

How do you separate genuine discovery from your own framing?

The fix that mattered most was conceptually simple: cut every transcript in two at the pivot point. Pre-pivot is unbiased (what they said before I primed them with a hypothesis). Post-pivot is feedback (their reactions to my framing). The two halves never get pooled.

The pivot moment is when a discovery call stops being discovery. Everything before is data. Everything after is a different kind of data.

The detector matches phrases I actually use to transition. Things like “let me share what I’ve heard from other CPOs,” “the patterns I’m seeing,” “switch to a demo.” It’s tuned to my verbal style, not a generic template, and it lives in discovery/tooling/config.json so anyone forking the skill can paste in their own phrasings.

A few mechanics I had to layer in to get it working:

A “preamble guard” that rejects pivot hits when the same sentence has telegraph language like “one last question and then I’ll switch to insights.” That’s still a discovery question. Sentence-scoped, because interviewers will sometimes telegraph and immediately reverse inside one utterance.
A “high-confidence” pattern set for phrases that never occur benignly mid-discovery (demo openers like “let me show you the product I built”). Those bypass the standard 10-minute intro budget so a follow-up call where I demo at minute 5 still gets cleanly cut.
Three transcript-format parsers because Gemini emits at least three different shapes. None of them are documented anywhere, so the parser carries fallbacks for headered-and-timestamped, bracket-speaker plain markdown, and inline-speaker plain markdown.
Drive ingestion via the Drive MCP server. Each meeting note is fetched as Markdown through mcp__Google-Drive__read_file_content, which returns a {"fileContent": "<markdown>"} envelope. A subagent fires the batch and writes the envelopes to disk, so the main thread never holds 30 raw transcripts at once.

The split is the entire point. Once it lands, every downstream question gets cleaner.

How do you query the corpus without blowing your context window?

You delegate every read to a subagent and only pull the synthesis back. The main agent never sees a full transcript.

Here’s what happens when I type /customer-discovery what pain points have CPOs mentioned around competitive intel?:

The skill auto-routes scope from the wording. “Pain” pulls only unbiased. “React” or “feedback” pulls only post-pivot. Otherwise both. Explicit flags override the auto-route when needed.
It extracts 3-5 keyword candidates from the question.
It hands the keywords to a general-purpose subagent with a tight brief: grep the relevant folder for matches, read the top 5-8 transcripts in full, return a synthesis with cited quotes.
The subagent’s answer comes back as a few paragraphs where every quote is wrapped in a markdown link to the source slug. The raw transcripts never enter my main context.

Token math. A full unbiased transcript is around 5k tokens. Reading 8 in main context would burn 40k tokens of raw text per question, and the answer doesn’t need them after the read. Delegating keeps the main thread clean and lets the corpus grow without touching the architecture.

I learned this the hard way. The first version pulled transcripts inline and my window filled up after three queries. Subagent-only is the only shape that scales.

What does this mean for a PM doing discovery in 2026?

Three habits shifted for me once the tooling existed. The same idea drives all three: the pre-pivot half of every call is the precious data, and everything in my workflow protects it.

Hold my own thinking until the end. I tell every interviewee in the first 3 minutes that I want to learn from their world first, and that I’ll share what I’ve been hearing from other product leaders only at the very end if there’s time. If I break my own rule and shift earlier at minute 12, I lose 30 minutes of unbiased data and the skill literally throws that half away. The cost is visible.
The Mom Test became a post-call audit, run by Claude. I can hand a fresh transcript to Claude and ask it to grade me against the Mom Test: where did I ask hypotheticals, where did I accept fluff, where did I forget the cost-of-inaction question, what’s the one follow-up I should have asked. The next call gets sharper. Across 46 transcripts, the patterns of my blind spots are obvious in a way no individual call could surface.
Citations are non-negotiable. Any summary tool that can’t tell me which call a quote came from is doing the same thing my brain does after a week. Confidently inventing patterns. If you build something similar, force every quote to carry a link back to the source slug. That one rule does most of the work for trust.

The skill is open-sourced at github.com/hacky-code/customer-discovery. It’s built for Claude Code, but the structure (pivot split, scope routing, subagent-only queries) translates anywhere you can write a deterministic post-processor and a delegated query runner.

The transcripts were fine. I was the one treating two recordings as one.

FAQ

Why are raw customer-discovery transcripts misleading?

Most discovery calls have two halves: the part where you listen, and the part where you share your own hypothesis. The second half is biased by your own framing, but most LLM analysis treats both halves as equal evidence. You end up reading your own framing back as if it were customer evidence.

What is the Mom Test in customer discovery?

Rob Fitzpatrick's three rules for unbiased customer interviews: talk about their life and not your idea, ask about specific past behavior and not hypothetical futures, and listen more than you talk. The book is the canonical reference for diagnosing interview bias.

How do you analyze customer interview transcripts with Claude Code?

Pull the transcripts into a folder Claude Code can read, then run a skill that fetches new ones from Drive, splits each call at the moment you stop asking discovery questions and start sharing your own hypothesis, and answers questions with cited quotes from specific calls. Keeping full transcripts out of the main context window matters as the corpus grows.