$ cat research_statement.txt

My research focuses on vision and language learning in large-scale multimodal systems, with an emphasis on visual reasoning. I study the behavior of frontier models through post-training methods and evaluation science — including benchmark design, large-scale evaluation across the full model lifecycle, and synthetic data generation to systematically probe and improve model capabilities, particularly in long-horizon, long-tail, and agentic settings.

Across my projects, a core theme has been multimodal safety and alignment, studied empirically through scalable oversight, and applied to steering increasingly autonomous human-oriented agents.

Most recently, I have adopted a keen interest in complex, advanced LLM capability evaluation and benchmarking for the sciences.

$ ls -lt ./projects/
01
Deep Research: Quantitative AI Math Benchmark
BENCHMARK | STAY TUNED
Contributing to the development of a formal mathematical reasoning benchmark that evaluates LLMs on open quantitative problems across domains — covering optimization, number theory, combinatorics, physics, and beyond, with the goal of enabling incremental, verifiable improvements.
02
The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation
MODEL 2025
Meta Superintelligence Labs
Llama 4 introduces the first open-weight natively multimodal models built on a mixture-of-experts architecture. The herd spans Scout (17B active params, 16 experts, 10M context window) and Maverick (17B active params, 128 experts), both distilled from Behemoth — a 288B active parameter teacher model surpassing GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks. Maverick achieves competitive results with DeepSeek v3 at less than half the active parameters, with a best-in-class performance-to-cost ratio.
03
Polyvore Outfits Dataset
BENCHMARK
Mariya I. Vasileva et al.
A large-scale fashion dataset curated by real users on the Polyvore platform. Each outfit consists of manually constructed combinations that reflect authentic human styling choices — capturing implicit knowledge of color harmony, stylistic coherence, and garment compatibility. Widely adopted across computer vision, recommendation systems, and multimodal learning research.
04
HandsOff: Labeled Dataset Generation with No Additional Human Annotations HIGHLIGHT
CVPR 2023
Austin Xu, Mariya I. Vasileva, Achal Dave, Arjun Seshadri
A framework capable of producing unlimited synthetic images and labels after training on fewer than 50 pre-existing labeled images. Unifies GAN inversion with dataset generation to produce rich pixel-wise labels across faces, cars, full-body poses, and urban driving scenes. Achieves state-of-the-art on semantic segmentation, keypoint detection, and depth estimation, and addresses long-tail challenges in model development.
05
Why do These Match? Explaining the Behavior of Image Similarity Models
ECCV 2020
Bryan A. Plummer*, Mariya I. Vasileva*, Vitali Petsiuk, Kate Saenko, David A. Forsyth
Introduces SANE (Salient Attributes for Network Explanation) — a method to explain image similarity models where outputs are similarity scores rather than classification labels. SANE pairs saliency maps with attributes that best explain a match between two images, providing information beyond standard saliency alone, and improving attribute recognition performance.
06
Learning Type-Aware Embeddings for Fashion Compatibility
ECCV 2018
Mariya I. Vasileva, Bryan A. Plummer, Krishna Dusad, Shreya Rajpal, Ranjitha Kumar, David A. Forsyth
Presents a contrastive learning vision-language embedding approach that respects item type and jointly learns notions of similarity and compatibility end-to-end. Evaluated on 68,306 outfits from Polyvore, achieving 3–5% improvement over state-of-the-art on compatibility prediction and fill-in-the-blank tasks, while supporting a variety of structured queries.
07
Learning Similarity Conditions Without Explicit Supervision
ICCV 2019
Reuben Tan, Mariya I. Vasileva, Bryan A. Plummer, Kate Saenko, David A. Forsyth
An approach that jointly learns representations for multiple similarity conditions and their contributions as a latent variable — without explicit supervision at test time. Outperforms state-of-the-art on Polyvore-Outfits, Maryland-Polyvore and UT-Zappos50k across fill-in-the-blank, compatibility prediction, and triplet tasks, even against strongly supervised methods.
mariya@research:~$ 
© 2026 Mariya I. Vasileva. All rights reserved.