Evaluating VLMs On Real Business And STEM Tasks

Turing Staff
23 Apr 20253 mins read
LLM training and enhancement
Evaluating VLMs On Real Business And STEM Tasks

As vision-language models (VLMs) continue to improve across traditional benchmarks, a more pressing question is emerging: can they handle real-world workflows?

For most enterprise and research environments, the answer is still no.

To close this gap, a new generation of evaluation frameworks is needed—ones that prioritize reasoning, task complexity, and domain alignment over recognition or recall.

Turing’s Research team is tackling this problem with a new benchmark framework designed to evaluate how VLMs perform in the environments where professionals actually work.

Why traditional VLM benchmarks miss the mark

Many leading benchmarks today evaluate general perception tasks—object detection, basic visual question answering, or captioning. These are valuable for measuring baseline capabilities, but they fall short in two important ways:

  • They don’t reflect decision-making contexts: Real workflows involve interpreting financial reports, scientific diagrams, or operational dashboards—not labeling everyday photos.
  • They avoid ambiguity: Most tasks are multiple choice, which favors surface-level understanding and eliminates the need for deep reasoning.

When vision models are deployed in the real world, ambiguity is everywhere. Charts are messy. Questions are open-ended. Tasks require synthesis across text and image—not just matching one to the other.

A benchmark framework built for real-world reasoning

Turing’s benchmark is designed for business and STEM domains, where accuracy, context, and reasoning precision directly affect outcomes.

“We wanted to stress-test where VLMs fail when tasks look more like engineering problems or scientific questions—not quizzes.”
Mahesh Joshi, Head of Research at Turing

Each task involves:

  • Multimodal prompts: Text-plus-image inputs with domain-specific structure (e.g., diagrams, charts, data tables)
  • Open-ended questions: No multiple choice—only free-form generation
  • Domain relevance: Tasks that mirror how professionals interact with information at work

Example task:

“I’m a structural engineer designing a steel support beam. What value of h results in the maximum shear flow through the welded surfaces?”

This isn’t visual trivia. It’s an expert-level prompt requiring the model to understand a diagram, apply structural mechanics, and return a usable answer—not a guess.

Evaluation using LLM-as-a-judge

To evaluate generated responses, the team uses a LLM-based judging system:

  • Each model produces five outputs per prompt
  • An LLM scores each output for accuracy and completeness (0.0–1.0)
  • Final accuracy = average score across runs

This allows the benchmark to detect:

  • Inconsistency across multiple runs
  • Partial correctness
  • The presence (or absence) of reasoning steps

Multiple-choice formats often hide these issues. LLM-as-a-Judge reveals them—and captures more realistic performance signals.

Key insights from early model evaluations

Across the benchmark, four leading models were tested. Key takeaways:

  • All models underperform on tasks requiring iterative or counterfactual reasoning
  • Temporal reasoning and analogical inference are frequent failure modes
  • Tasks with longer question prompts showed higher variance in model responses
  • Multistep tasks compound performance degradation, especially when chart-reading and inference are required in tandem

These aren’t fringe edge cases. They reflect the everyday demands placed on VLMs in enterprise and research workflows.

What’s next: A benchmark designed for enterprise alignment

This benchmark is not static. It’s versioned, private, and built for continuous refinement. Future plans include:

  • Expanding into medical, legal, and lab-based document reasoning
  • Allowing collaborators to submit failure cases or prompt formats to expand coverage

For researchers and enterprise teams, this provides a framework to test real capability—not just leaderboard potential.

Read: Inside the Turing Applied AGI Benchmark for VLM 1.0 →

Ready to turn AGI research into real-world results?

Start your journey to deliver measurable outcomes with cutting-edge intelligence.

Get Started