This week in AGI Advance, we go inside Turing’s latest efforts in multimodal evaluation, explore research pushing LLM reasoning performance with negative data, and highlight model advances in cross-modal generation and scientific comprehension.

From testing adversarial image prompts to curating long-context science datasets, one thing is clear: training smarter models requires testing them smarter too.

What we're thinking

At Turing, we’re scaling the frontier of multimodal LLM evaluation—where text, vision, and audio meet in real-world tasks. This week, we’re reflecting on insights from our RLHF + SFT campaign focused on building grounded multimodal training data across image, audio, and safety-critical contexts.

Multimodal tuning meets real-world taxonomies: We worked across a wide range of prompt and image types (e.g. infographics, UI screenshots, object photos, and surreal concepts) to align with domain-specific taxonomies. We developed an internal tracking infrastructure to ensure distribution targets were met without prompt or image duplication.
Safety as a first-class citizen: From adversarial prompt design to harmfulness detection in vision tasks, our teams stress-tested safety boundaries with a large volume of adversarial inputs. One-third of them triggered major issues—surfacing crucial model gaps that traditional evals overlook.
Audio as the next modality frontier: We’re now working on voice samples with noise-free and interference rich environments for reasoning tasks, with more interest in video on the horizon. Turing is sourcing diverse talent across multiple locales to support multilingual, speaker-varied, and acoustically controlled pipelines at scale.

As AI systems become multimodal, so must our eval infrastructure. The future of training and alignment isn’t just about scale—it’s about coverage, diversity, and trust.

What we're reading

CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning
CURIE is a benchmark for long-context scientific reasoning across domains like quantum computing, biodiversity, and materials science. With 10 complex tasks and examples from real research papers, CURIE challenges models to do more than summarize—it pushes them to extract, infer, and compute. Even top models (Claude 3, Gemini 2.0 Flash) peak around 32% accuracy, highlighting the difficulty of true scientific comprehension.
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
CMU and DeepMind researchers show that using incorrect, model-generated math traces—if processed properly—can improve LLM math reasoning 8× more efficiently than standard fine-tuning. The key: apply per-step reinforcement learning to identify and fix "critical steps" in bad outputs. The paper reframes negative data not as noise, but as an asset—helping models unlearn spurious correlations and improve generalization.
NExT-GPT: Any-to-Any Multimodal LLM
This paper introduces NExT-GPT, an end-to-end MM-LLM that handles input and output across text, image, audio, and video. It uses lightweight projection tuning and a novel MosIT instruction tuning dataset to deliver robust reasoning and generation. Unlike pipeline-based approaches, NExT-GPT is unified—achieving state-of-the-art performance on tasks like image captioning, video QA, audio synthesis, and cross-modal dialogue.

Where we’ll be

Turing will be at two major AI conferences in the coming months—join us to discuss the future of AGI:

ICLR 2025 [Singapore | Apr 24 – 28]
A top-tier deep learning conference covering representation learning, AI optimization, and theoretical advancements.
MLSys 2025 [Santa Clara, CA | May 12 – 15]
A major event focused on the intersection of machine learning and systems, discussing efficient AI model training, distributed learning, and AI hardware innovations.

If you’re attending, reach out—we’d love to connect and exchange insights!

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]