Portfolio

Editorial lead and technical writer specializing in AI/ML content, working with researchers and engineers at Scale AI from early framing through publication across benchmarks, evaluations, and model behavior research.

Selected Editorial Work:

I'm Afraid I Can't Let You Do That - We replicated Anthropic's Claude 4 blackmail test and found divergent, sometimes more troubling results across frontier models. Co-authored with Jeremy Kritz.

Voice Showdown: The First Arena for Voice AI - Launching the first preference-based arena for voice AI, built on real human speech across 60+ languages. Co-authored with Janie Gu and Advait Gosai.

MoReBench: Evaluating the Process of AI Moral Reasoning - A new benchmark evaluating how models reason through moral dilemmas, not just what they conclude. Co-authored with Brandon Handoko.

Real Speech Breaks AI (And What We're Doing to Fix It) - The first benchmark designed to stress-test conversational robustness in native Speech-to-Speech models.

Crumbling Under Pressure: PropensityBench Reveals AI's Weaknesses - When the safe path fails, will a model take a harmful shortcut? The answer is uncomfortable. Co-authored with Madhu Sehwag.

New Benchmarks Envision the Future of AI in Healthcare - A comparative analysis of three landmark evaluation methodologies from OpenAI, Google DeepMind, and Microsoft — and what they reveal about what clinical competence in AI actually requires.