Improving LLM Coding Accuracy in 2 Weeks through Multifaceted Evaluation

Developed a detailed understanding of proprietary AI model’s strengths and weaknesses through a comprehensive evaluation approach. Identified key areas for improvement, enhanced model performance, and laid the groundwork for targeted refinements.

6

unique testing methods to develop detailed understanding of model performance across multiple functions

Actionable

insights for granular understanding of model strengths, weaknesses, and refinements

Improved

data accuracy and reliability for continuous model enhancements

IndustryTechnology
Company typeEnterprise
CountryUnited States
Services usedLLM Training
Room of technical professionals performing a review after evaluating their large language model.

About the client

The client is a leading U.S.-based global technology company specializing in social media, AI research, and virtual/augmented reality.

The problem

The client sought expert evaluation to enhance their custom-built large language model’s accuracy, efficiency, and reliability. The model demonstrated inconsistent performance, excelling in tasks like sentiment analysis and terminology but struggling with complex coding tasks and prompt accuracy. The lack of a detailed understanding of its strengths and weaknesses hindered further optimization and effective deployment.

The solution

Six targeted evaluation projects were implemented over two weeks to systematically analyze the model's strengths and weaknesses, providing actionable insights for specific improvements. These insights enabled targeted refinements, laying the foundation for ongoing model enhancement and better performance.

  • Difficulty criteria (dimensionality, number of constraints, lines of code, complex problem solving)
  • Multi-faceted model evaluation:
    1. Guided API Evaluation (text summarization, sentiment analysis, language translation)
    2. Freestyle API Evaluation (testing real-world prompts and coding use cases in a sandbox environment)
    3. Prompt Breaking (identifying code and prompts that challenge the model)
    4. LLM and Human Benchmark Analyses (analyzing failures against complex coding and industry benchmarks)
    5. Community Findings Aggregation (GitHub, Reddit sentiment analysis)
    6. RLHF & Calibration
  • Assessment levels: Four levels tested, from principal engineer-level tasks to rudimentary-level tasks. 

The result

Developed a comprehensive understanding of the AI model's performance across various tasks, identifying key areas for improvement to enhance accuracy and efficiency. These results provided actionable insights for targeted refinements, ensuring continuous model enhancement and improved overall performance.

  • Comprehensive evaluation: Completed 6 projects in a 2-week sprint, evaluating dimensionality, constraints, code generation, and problem-solving complexity.
  • API enhancements: Identified strengths in sentiment analysis and legal terminology processing, and addressed weaknesses in complex language translations and medical jargon accuracy.
  • Improved prompt reliability: Identified failure cases in granular topics and complex logical puzzles, leading to more reliable model prompts.
  • Benchmarking insights: Highlighted difficulties in highly contextual tasks and revealed errors in logic and syntax through both LLM and human analyses.
  • Community feedback aggregation: Validated correctness and optimality of information/code, and prepared a data split for the next phase: 20% targeted scenarios, 40% general weaknesses, and 40% baseline subsets.
  • Difficulty rubric: Created to iteratively assess prompt complexity, ensuring accurate evaluation from hard to easy tasks.

Want to accelerate your business with AI?

Talk to one of our solutions architects and get a
complimentary GenAI advisory session.

Get Started

Share

Want to accelerate your business with AI?

Talk to one of our solutions architects and start innovating with AI-powered talent.

Get Started