Evidence Engine for Product Decisions
Nov 2024
View on GitHubThe Business Problem
Product teams waste billions building features nobody uses. Industry data shows only 6.4% of features drive 80% of usage (Pendo, 2024). Companies collectively spend $29.5 billion annually on features customers rarely touch.
The root cause isn't lack of data. It's systematic bias in how product managers interpret that data. Teams use frameworks like RICE (Reach, Impact, Confidence, Effort) that look rigorous but actually amplify bias rather than reduce it.
- HiPPO bias: Executive opinions override user research
- Confirmation bias: Teams seek evidence supporting predetermined decisions
- Anchoring bias: Initial estimates distort all subsequent judgments
The result: Only 8% of high-confidence product bets succeed. Frameworks become post-hoc rationalization tools instead of decision aids.
The Research: Validating the Problem
I conducted user research with 4 product managers and stakeholders to understand how prioritization actually happens in practice.
Research Methodology
- Semi-structured interviews with PMs at startups, growth companies, and financial services
- Survey on framework usage and bias recognition
- Analysis of 50+ academic papers on cognitive bias
- Review of industry data from Pendo, Microsoft, Amazon
All four participants confirmed experiencing bias in prioritization. Three explicitly described frameworks as abandoned or used only for stakeholder communication, not actual decision-making.
Key Findings
- "Frameworks break down due to stakeholder requirements" - Wesley, B2C PM
- "I adjust scores to make things work politically" - Adnan, Startup Founder and PM with 10+ years experience
- "Data is the biggest challenge. You largely go with gut/intuition" - Wesley
- "Relationships determine outcomes" - Jorge, Stakeholder at PC Financial
The research revealed two distinct problems: PMs need better evidence gathering for their own decisions, and they need better ways to defend those decisions to stakeholders. Current frameworks solve neither.
The Solution: Evidence Engine
I built an AI tool that transforms unstructured research into evidence-based decisions. Instead of asking PMs to score features numerically, it guides them through structured evidence gathering and generates transparent reasoning.
How it works:
- Extract: Parses interviews, feedback, and analytics into 6 evidence types (user quotes, behavioral observations, support tickets, analytics, stakeholder input, competitor intel)
- Analyze: Tests hypotheses by actively searching for counter-evidence, not just confirming evidence
- Synthesize: Generates reasoning traces showing exactly how conclusions were reached, with evidence counts and quality assessment
The key difference: traditional frameworks ask "what's your Impact score?" This tool asks "what happens if you don't build this?" and "what evidence supports that claim?"
Technical Approach
Built with Python, Google Gemini API, and Streamlit. The architecture separates evidence extraction, intent classification, and reasoning generation into modular components.
Three-stage workflow processes unstructured inputs into structured, defensible recommendations with complete reasoning transparency.
Key features:
- Counter-evidence surfacing: Actively searches for data contradicting hypotheses
- Quality over quantity: Prioritizes strong evidence (user interviews, analytics) over weak evidence (stakeholder opinions)
- Transparent reasoning: Shows exactly which evidence led to which conclusions
- Conversational interface: Natural language queries instead of scoring spreadsheets
I chose Google Gemini over other LLMs because it offers free API access (making the tool accessible to small teams) while maintaining strong reasoning capabilities for evidence synthesis.
How It Reduces Bias
Structural Constraints, Not Awareness Training
Research shows bias awareness doesn't prevent biased decisions. Anchoring bias persists even after debiasing training (Cohen's d = 1.19 reduces to 0.72, still a large effect).
Evidence Engine addresses this through process design:
Each bias type is addressed through specific mechanisms built into the evidence gathering process.
- Confirmation bias: Forced counter-evidence search. System asks "what evidence contradicts this hypothesis?"
- HiPPO bias: Evidence classification shows when stakeholder input dominates user data
- Anchoring bias: Historical calibration compares current estimates to past accuracy
- Recency bias: Temporal weighting shows how much recent vs. historical data drives conclusions
Transparency as Trust Mechanism
User research revealed PMs won't trust a "black box" AI. Both surveyed PMs required transparency and override ability. The tool addresses this by showing complete reasoning traces with evidence counts and allowing PMs to adjust or reject conclusions.
Business Impact for Financial Services
Evidence Engine addresses challenges specific to banking and financial services product teams:
For Product Strategy:
- Regulatory compliance: Complete evidence trails for feature prioritization decisions support audit requirements
- Risk management: Counter-evidence surfacing identifies potential issues before launch
- Stakeholder alignment: Generated reasoning helps justify decisions to risk, compliance, and executive teams
For Decision Quality:
- Reduce feature waste: Industry data shows 80% of features are rarely used. Better evidence gathering reduces this waste
- Improve prediction accuracy: Only 8% of high-confidence bets succeed currently. Bias reduction improves this rate
- Enable learning: Outcome tracking creates feedback loops missing in current processes
For Team Efficiency:
- Fast evidence synthesis: Designed to work in under 15 minutes per feature (research showed speed is critical for adoption)
- Reduce rework: Better initial decisions mean fewer pivots and feature deprecations
- Defensible recommendations: Auto-generated reasoning reduces time spent justifying decisions
What I Learned
From the Research
Frameworks fail because they ask PMs to rate things numerically without structure. "What's your Impact score (0.25 to 3)?" invites bias. "What happens if you don't build this?" forces concrete thinking.
PMs don't want automation. They want support. As one PM said: "I want it to be more of a support tool, because I think actually providing the answers, that's like my job." Tools that try to replace judgment will fail.
Storytelling matters as much as scoring. The best data doesn't change decisions if you can't convince stakeholders. Tools must generate persuasive narratives, not just accurate numbers.
From the Implementation
LLMs are good at evidence synthesis but need structured prompts. Separating extraction, analysis, and synthesis into distinct phases with purpose-built prompts produces better results than single-shot generation.
Transparency is non-negotiable. PMs won't trust AI they can't inspect. Showing reasoning traces with evidence counts and quality assessments builds trust more than claiming high accuracy.
Next Steps
- User testing: Validate with 10+ PMs to measure actual usage patterns and adoption barriers
- Outcome tracking: Add post-launch measurement to compare predictions vs. actuals
- Integration: Build Slack/Teams bot and browser extensions to meet PMs in their workflow
- Calibration: Add company-specific historical data to improve prediction accuracy over time