STEM datasets for post-training evaluation and reasoning

Human-authored datasets, benchmarks, and tools for evaluating and improving scientific and mathematical reasoning in LLMs.

Request STEM Data

STEM datasets

Human-authored datasets across STEM domains to support scientific accuracy, alignment training, and symbolic rigor at scale.

Math, Physics, Chemistry, and Biology Datasets

Datasets curated to test logical structure, problem-solving accuracy, and formal rigor, grounded in real-world scientific domains.

Request STEM Data Packs

Chain-of-Thought + Stepwise Reasoning Packs

Trace-based reasoning examples scored for fidelity, designed for training and reward shaping.

Request CoT Datasets

High-Throughput Training Data

Fast, structured datasets optimized for SFT, RLHF, and symbolic alignment workflows.

Request Sample Datasets

Lean-Based Proof QA Datasets

Iterative proof generation in Lean 4 paired with informal math questions, supporting symbolic reasoning and fine-tuned verification.

Request Symbolic Reasoning Datasets

Benchmarks and evaluation

Rubric-aligned benchmarks and structured diagnostics that surface STEM-specific model weaknesses and reasoning gaps.

VLM-Bench

Benchmark model reasoning on over 700 vision–language tasks grounded in STEM, logic, and world knowledge.

Download Report

GPQA, AIME, and MMLU-Pro Comparisons

See how your models stack up on known benchmarks or define your own test sets with domain-specific metrics.

Run a Diagnostic

High-Difficulty STEM Benchmarks

Evaluate your model’s capability on problems unsolvable by SOTA models, paired with rubric-based grading and expert-written answers.

Run a Diagnostic

RL environments for STEM workflows

Evaluate agents on real-world STEM tasks, generate fine-tuning trajectories, and train reward models in reproducible, high-fidelity environments.

UI-Based RL Environments for Interface Agents

Evaluate scientific reasoning agents within virtual lab environments that simulate physics, chemistry, or biological systems.

Request UI Agent Environments

MCP Environments for Function-Calling Agents

Train agents on function calling and tool execution inside sandboxed server environments. Includes tool schemas, reward verifiers, and seed databases.

Request Function-Calling Environments

End-to-End Evaluation and Training Loops

Each RL environment includes prompts, verifiers, analytics harnesses, and trajectory outputs, enabling evaluation diagnostics, reward shaping, and supervised fine-tuning at scale.

Request RL Environments

Research and case studies

Case Study

Revealing Systemic Chart Reasoning Gaps with 20K+ Expert CoTs

Built a 20K-sample dataset to surface model failures in scientific chart reasoning, enabling more accurate eval, reward shaping, and subfigure calibration.

Read Case Study

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Over 200 math problems were annotated with strict binary correctness per step, including written justifications. Annotators followed a zero-tolerance error-carry-forward policy, ensuring robust analysis of chain-of-thought consistency in long-form mathematical outputs.

Read Article

Resource

Lean & Symbolic Reasoning for LLM-Based Mathematical Problem Solving

By treating Lean as a formal API for proof verification, we can integrate symbolic logic into LLM reasoning. This enables a neuro-symbolic workflow where models propose ideas and Lean enforces validity, producing auditable proofs.

Read Article

Resource

Chain of Experts: What Is It and How It Solves MoE’s Limitations

CoE is an architectural advancement in the sparse model family that addresses Mixture of Experts (MoE) limitations by enabling sequential expert activation with intermediate communication...

Read Article

Resource

Training LLM Agents in RL Gyms: From Curriculum Design to Measurable Rewards

For LLM agents, RL Gyms can replicate long-horizon, tool-using, and reasoning-intensive workflows within a controlled, reproducible framework.

Read Article

Article

Why Vision-Language Models Still Struggle With Real Business And STEM Workflows

VLMs perform well on academic benchmarks—but struggle in business and STEM workflows. Learn why real-world reasoning challenges remain unsolved.

Read Article

What STEM domains does Turing provide datasets for?

Turing offers human-authored datasets across math, physics, chemistry, and biology designed to test logical structure, problem-solving accuracy, and formal rigor grounded in real-world scientific domains.

What are Chain-of-Thought datasets used for?

Chain-of-Thought datasets are trace-based reasoning examples scored for fidelity, designed specifically for training and reward shaping to improve stepwise reasoning capabilities.

Does Turing provide benchmarks for evaluating LLM performance?

Yes, Turing offers rubric-aligned benchmarks including VLM-Bench with over 700 vision-language tasks, high-difficulty STEM benchmarks, and comparisons against known benchmarks like GPQA, AIME, and MMLU-Pro.

What are RL Eenvironments for STEM workflows?

Turing provides reproducible, high-fidelity environments where you can evaluate agents on real-world STEM tasks, generate fine-tuning trajectories, and train reward models, including UI-based environments and MCP environments for function-calling agents.

What is included in Turing's RL environments?

Each RL environment includes prompts, verifiers, analytics harnesses, and trajectory outputs, enabling evaluation diagnostics, reward shaping, and supervised fine-tuning at scale.

Does Turing offer Lean-based proof datasets?

Yes, Turing provides Lean-based proof QA datasets featuring iterative proof generation in Lean 4 paired with informal math questions, supporting symbolic reasoning and fine-tuned verification.

What is VLM-Bench?

VLM-Bench is Turing’s benchmark for vision-language reasoning, covering more than 700 tasks across STEM, logical inference, spatial reasoning, and real-world multimodal problem-solving.

Can Turing's datasets be used for both training and evaluation?

Yes. Turing’s data packs and datasets support post-training workflows including supervised fine-tuning, evaluator calibration, symbolic reasoning tasks, and structured evaluation across STEM domains.

Scale STEM reasoning with expert-built datasets

Train, fine-tune, or evaluate models on structured STEM tasks, backed by domain-reviewed data and traceable QA.

Talk to a Researcher

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now

STEM datasets for post-training evaluation and reasoning

STEM datasets

Math, Physics, Chemistry, and Biology Datasets

Chain-of-Thought + Stepwise Reasoning Packs

High-Throughput Training Data

Lean-Based Proof QA Datasets

Benchmarks and evaluation

VLM-Bench

GPQA, AIME, and MMLU-Pro Comparisons

High-Difficulty STEM Benchmarks

RL environments for STEM workflows

UI-Based RL Environments for Interface Agents

MCP Environments for Function-Calling Agents

End-to-End Evaluation and Training Loops

Research and case studies

FAQs

What STEM domains does Turing provide datasets for?

What are Chain-of-Thought datasets used for?

Does Turing provide benchmarks for evaluating LLM performance?

What are RL Eenvironments for STEM workflows?

What is included in Turing's RL environments?

Does Turing offer Lean-based proof datasets?

What is VLM-Bench?

Can Turing's datasets be used for both training and evaluation?

Scale STEM reasoning with expert-built datasets

AGI Advance Newsletter