Get Curated Research Datasets

Access benchmark-quality RL, multimodal, vision, and STEM datasets to accelerate your post-training research. Choose from pre-defined packs or create custom datasets tailored to your experiments.

Request Data Packs

Dataset Catalog

Choose from our curated data collections optimized for post-training research and ready to request:

Multimodal Data

Audio, vision, and interface datasets for evaluating and training multimodal reasoning across real-world workflows.
View Datasets

Domain-Specific Data

Finance, economics, medical, and legal datasets designed with subject-matter expertise to support domain-grounded model performance.
View Datasets

STEM Data

Expert-curated datasets in chemistry, physics, biology, and mathematics built to advance scientific reasoning and computational precision.
View Datasets

Coding Data

Frontier datasets for reasoning, function calling, and real-world coding benchmarks including SWE-Bench and SWE-Bench-Verified.
View Datasets

Robotics & Embodied AI Data

World modeling and embodied reasoning data, spanning video game simulations, teleoperation demos, and annotated trajectories.
View Datasets

Custom Data

Scoped datasets for edge cases, novel modalities, and emerging research needs.
View Datasets

Why These Datasets

Real-World & Benchmark Relevance

Collections align with real-world workflows and both public and proprietary benchmarks, providing coverage across the standards that define model maturity.

Expert Curation & Quality

All datasets are validated by PhD-level researchers and domain experts to ensure reproducibility, reliability, and readiness for integration.

Reproducible Methods

All datasets are built with reproducible methods and traceable QA, ensuring results can be validated and integrated with confidence.

Frequently Asked Questions

How long does it take to receive sample data?

Samples are delivered via email and typically within 48 hours of your request, so you can begin integration and evaluation without delay.

Can I request multiple datasets at once?

Yes, you can select any combination of pre-defined packs or custom datasets in a single request form, and we’ll bundle them in one delivery.

What formats and modalities are supported?

We provide samples in ML–ready formats (e.g., image folders, CSV/JSON for tabular and text, WAV for audio). All modalities listed in the catalog—vision, audio, STEM, coding, and more—are available.

How do you license sample data?

Sample datasets are provided under a research-only license. For full-pack access or commercial use, please ask about terms and pricing.

Can I get a custom data pack if I don’t see what I need?

Yes, select "custom" option in your request and provide additional details. Our research team will work with you to assemble the right dataset.

What happens after I receive samples?

You’ll receive curated sample files and metadata, followed by outreach from our research team to discuss full-pack access, volume, pricing, and any custom adjustments.

Ready for Frontier Model Data?

Request your data packs today and accelerate your research.

Request Data Packs