STEM datasets for post-training evaluation and reasoning

Human-authored datasets, benchmarks, and tools for evaluating and improving scientific and mathematical reasoning in LLMs.

STEM datasets

Human-authored datasets across STEM domains to support scientific accuracy, alignment training, and symbolic rigor at scale.

Math, Physics, Chemistry, and Biology Datasets

Datasets curated to test logical structure, problem-solving accuracy, and formal rigor, grounded in real-world scientific domains.
Request STEM Data Packs

Chain-of-Thought + Stepwise Reasoning Packs

Trace-based reasoning examples scored for fidelity, designed for training and reward shaping.
Request CoT Datasets

High-Throughput Training Data

Fast, structured datasets optimized for SFT, RLHF, and symbolic alignment workflows.
Request Sample Datasets
Domain-specific dataset development

Lean-Based Proof QA Datasets

Iterative proof generation in Lean 4 paired with informal math questions, supporting symbolic reasoning and fine-tuned verification.
Request Symbolic Reasoning Datasets

Benchmarks and evaluation

Rubric-aligned benchmarks and structured diagnostics that surface STEM-specific model weaknesses and reasoning gaps.

LLM Icon

VLM-Bench

Benchmark model reasoning on over 700 vision–language tasks grounded in STEM, logic, and world knowledge.
Download Report

GPQA, AIME, and MMLU-Pro Comparisons

See how your models stack up on known benchmarks or define your own test sets with domain-specific metrics.
Run a Diagnostic
STEM Icons__Search-resistant problem formulation

High-Difficulty STEM Benchmarks

Evaluate your model’s capability on problems unsolvable by SOTA models, paired with rubric-based grading and expert-written answers.
Run a Diagnostic

RL environments for STEM workflows

Evaluate agents on real-world STEM tasks, generate fine-tuning trajectories, and train reward models in reproducible, high-fidelity environments.

UI-Based RL Environments for Interface Agents

Evaluate scientific reasoning agents within virtual lab environments that simulate physics, chemistry, or biological systems.
Request UI Agent Environments

MCP Environments for Function-Calling Agents

Train agents on function calling and tool execution inside sandboxed server environments. Includes tool schemas, reward verifiers, and seed databases.
Request Function-Calling Environments

End-to-End Evaluation and Training Loops

Each RL environment includes prompts, verifiers, analytics harnesses, and trajectory outputs, enabling evaluation diagnostics, reward shaping, and supervised fine-tuning at scale.
Request RL Environments

Research and case studies

FAQs

What STEM domains does Turing provide datasets for?

Turing offers human-authored datasets across math, physics, chemistry, and biology designed to test logical structure, problem-solving accuracy, and formal rigor grounded in real-world scientific domains.

What are Chain-of-Thought datasets used for? 

Chain-of-Thought datasets are trace-based reasoning examples scored for fidelity, designed specifically for training and reward shaping to improve stepwise reasoning capabilities.

Does Turing provide benchmarks for evaluating LLM performance?

Yes, Turing offers rubric-aligned benchmarks including VLM-Bench with over 700 vision-language tasks, high-difficulty STEM benchmarks, and comparisons against known benchmarks like GPQA, AIME, and MMLU-Pro.

What are RL Eenvironments for STEM workflows? 

Turing provides reproducible, high-fidelity environments where you can evaluate agents on real-world STEM tasks, generate fine-tuning trajectories, and train reward models, including UI-based environments and MCP environments for function-calling agents.

What is included in Turing's RL environments? 

Each RL environment includes prompts, verifiers, analytics harnesses, and trajectory outputs, enabling evaluation diagnostics, reward shaping, and supervised fine-tuning at scale.


Does Turing offer Lean-based proof datasets?

Yes, Turing provides Lean-based proof QA datasets featuring iterative proof generation in Lean 4 paired with informal math questions, supporting symbolic reasoning and fine-tuned verification.

What is VLM-Bench? 

VLM-Bench is Turing’s benchmark for vision-language reasoning, covering more than 700 tasks across STEM, logical inference, spatial reasoning, and real-world multimodal problem-solving.

Can Turing's datasets be used for both training and evaluation?

Yes. Turing’s data packs and datasets support post-training workflows including supervised fine-tuning, evaluator calibration, symbolic reasoning tasks, and structured evaluation across STEM domains.

Scale STEM reasoning with expert-built datasets

Train, fine-tune, or evaluate models on structured STEM tasks, backed by domain-reviewed data and traceable QA.

Talk to a Researcher

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now