My research focuses on vision and language learning in large-scale multimodal systems, with an emphasis on visual reasoning. I study the behavior of frontier models through post-training methods and evaluation science — including benchmark design, large-scale evaluation across the full model lifecycle, and synthetic data generation to systematically probe and improve model capabilities, particularly in long-horizon, long-tail, and agentic settings.
Across my projects, a core theme has been multimodal safety and alignment, studied empirically through scalable oversight, and applied to steering increasingly autonomous human-oriented agents.
Most recently, I have adopted a keen interest in complex, advanced LLM capability evaluation and benchmarking for the sciences.