mariya.fyi — research

$ cat research_statement.txt

My research focuses on vision and language learning in large-scale multimodal systems, with an emphasis on visual reasoning. I study the behavior of frontier models through post-training methods and evaluation science — including benchmark design, large-scale evaluation across the full model lifecycle, and synthetic data generation to systematically probe and improve model capabilities, particularly in long-horizon, long-tail, and agentic settings.

Across my projects, a core theme has been multimodal safety and alignment, studied empirically through scalable oversight, and applied to steering increasingly autonomous human-oriented agents.

Most recently, I have adopted a keen interest in complex, advanced LLM capability evaluation and benchmarking for the sciences.

$ ls -lt ./projects/

Deep Research: Quantitative AI Math Benchmark

BENCHMARK | STAY TUNED

▶

Contributing to the development of a formal mathematical reasoning benchmark that evaluates LLMs on open quantitative problems across domains — covering optimization, number theory, combinatorics, physics, and beyond, with the goal of enabling incremental, verifiable improvements.

The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation

MODEL 2025

▶

Meta Superintelligence Labs

Llama 4 introduces the first open-weight natively multimodal models built on a mixture-of-experts architecture. The herd spans Scout (17B active params, 16 experts, 10M context window) and Maverick (17B active params, 128 experts), both distilled from Behemoth — a 288B active parameter teacher model surpassing GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks. Maverick achieves competitive results with DeepSeek v3 at less than half the active parameters, with a best-in-class performance-to-cost ratio.

models blog post

Polyvore Outfits Dataset

BENCHMARK

▶

Mariya I. Vasileva et al.

A large-scale fashion dataset curated by real users on the Polyvore platform. Each outfit consists of manually constructed combinations that reflect authentic human styling choices — capturing implicit knowledge of color harmony, stylistic coherence, and garment compatibility. Widely adopted across computer vision, recommendation systems, and multimodal learning research.

dataset page

HandsOff: Labeled Dataset Generation with No Additional Human Annotations HIGHLIGHT

CVPR 2023

▶

Austin Xu, Mariya I. Vasileva, Achal Dave, Arjun Seshadri

A framework capable of producing unlimited synthetic images and labels after training on fewer than 50 pre-existing labeled images. Unifies GAN inversion with dataset generation to produce rich pixel-wise labels across faces, cars, full-body poses, and urban driving scenes. Achieves state-of-the-art on semantic segmentation, keypoint detection, and depth estimation, and addresses long-tail challenges in model development.

project page code arXiv

Why do These Match? Explaining the Behavior of Image Similarity Models

ECCV 2020

▶

Bryan A. Plummer*, Mariya I. Vasileva*, Vitali Petsiuk, Kate Saenko, David A. Forsyth

Introduces SANE (Salient Attributes for Network Explanation) — a method to explain image similarity models where outputs are similarity scores rather than classification labels. SANE pairs saliency maps with attributes that best explain a match between two images, providing information beyond standard saliency alone, and improving attribute recognition performance.

code arXiv

Learning Type-Aware Embeddings for Fashion Compatibility

ECCV 2018

▶

Mariya I. Vasileva, Bryan A. Plummer, Krishna Dusad, Shreya Rajpal, Ranjitha Kumar, David A. Forsyth

Presents a contrastive learning vision-language embedding approach that respects item type and jointly learns notions of similarity and compatibility end-to-end. Evaluated on 68,306 outfits from Polyvore, achieving 3–5% improvement over state-of-the-art on compatibility prediction and fill-in-the-blank tasks, while supporting a variety of structured queries.

code arXiv

Learning Similarity Conditions Without Explicit Supervision

ICCV 2019

▶

Reuben Tan, Mariya I. Vasileva, Bryan A. Plummer, Kate Saenko, David A. Forsyth

An approach that jointly learns representations for multiple similarity conditions and their contributions as a latent variable — without explicit supervision at test time. Outperforms state-of-the-art on Polyvore-Outfits, Maryland-Polyvore and UT-Zappos50k across fill-in-the-blank, compatibility prediction, and triplet tasks, even against strongly supervised methods.

code arXiv

mariya@research:~$