This week, we've been reflecting on how data gets collected, structured, and evaluated for supervised and reward-based LLM training. From preference optimization to multilingual coverage and rollout design, it’s clear: fine-tuning isn’t just about more data—it’s about the right data.

What we're thinking

As model training scales, we’ve been thinking deeply about what kind of data actually drives performance—and what happens when it doesn’t. Here’s what’s been top of mind:

Choosing the right SFT data matters more than ever: With models growing in generalization power, the marginal value of each training pair grows. That means better prompt curation, diversity in response lengths, and robust quality checks are critical.
Synthetic data pipelines are growing up: Techniques like masking, contrastive pairing, and structured perturbation are increasingly used to build scalable, label-efficient datasets for everything from summarization to math reasoning.
Not all preference data is equal: Human labeling instructions, sampling strategy, and distribution coverage directly shape model outcomes. Getting preference optimization right starts with clearer reward signals—and sharper eval loops.
Improving coverage for multilingual and context-aware models: Cross-lingual representation gaps, noisy doc-context alignment, and domain variance are still major bottlenecks in scaling RAG and non-English performance.

What we're reading

Large Language Models Pass the Turing Test
In a controlled, pre-registered three-party Turing test, GPT-4.5 (with persona prompting) was judged to be human 73% of the time—more often than the real human it was paired with. LLaMa-3.1 also performed near human level, while GPT-4o and ELIZA scored far below. This marks the first empirical evidence that modern LLMs can reliably pass the original Turing test, raising new questions about how we evaluate intelligence, deception, and human-likeness in AI systems.
Google’s LLMs Shut Down 39 Million Fraudulent Advertisers
Google’s latest Ads Safety Report reveals its LLM-powered enforcement systems suspended 39.2 million advertiser accounts in 2024—most before they ever ran an ad. These models helped detect fraud at setup time using patterns like fake payment methods and impersonation attempts. Google also blocked or restricted over 14 billion ads and took action on 1.3 billion publisher pages, signaling how AI is reshaping large-scale trust and safety ops across the ad ecosystem.
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
This paper shows that RAFT, a rejection sampling method using only correct responses, rivals or outperforms complex RL methods on math benchmarks. The authors propose Reinforce-Rej, a minimalist variant that filters both bad and perfect prompts—delivering strong results with better training stability. The work challenges the assumed value of negative samples in LLM post-training.

Where we’ll be

Turing will be at two major AI conferences in the coming months—join us to discuss the future of AGI:

ICLR 2025 [Singapore | Apr 24 – 28]
A top-tier deep learning conference covering representation learning, AI optimization, and theoretical advancements.
MLSys 2025 [Santa Clara, CA | May 12 – 15]
A major event focused on the intersection of machine learning and systems, discussing efficient AI model training, distributed learning, and AI hardware innovations.

If you’re attending, reach out—we’d love to connect and exchange insights!

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]