Leverage Turing Intelligence capabilities to integrate AI into your operations, enhance automation, and optimize cloud migration for scalable impact.
Advance foundation model research and improve LLM reasoning, coding, and multimodal capabilities with Turing AGI Advancement.
Access a global network of elite AI professionals through Turing Jobs—vetted experts ready to accelerate your AI initiatives.
As the world advances in the field of natural language processing (NLP), significant strides have been made in the development of AI models. Tech giants like OpenAI, Google, Microsoft, and others have created advanced large language models (LLMs). Trained on large datasets, they can be used to generate meaningful text from user inputs. For example, for summarizing long texts, answering customers’ questions, etc.
Some of the popular LLMs are GPT-1,2,3, Chinchilla, BERT, Palm, and Gopher. Each has its strengths and weaknesses. In this article, we will discuss Chinchilla AI developed by Google’s DeepMind.
Before the development of Chinchilla AI, big tech companies were creating large models that were not very efficient and required very high computational power. In 2020, Kaplan and others at OpenAI found that a significant portion of compute budget should be allocated towards increasing the number of parameters (power-law relationship). Case in point, GPT-3, which performs better than GPT-2, and is also a bigger model.
DeepMind’s Chinchilla AI model uses the same compute budget as Gopher but also 70 billion parameters with four times more data. Its performance is better than that of GPT-3 and Gopher. Let’s explore it in more detail.
The architecture for the Chinchilla language model is the same as what was developed for Gopher with a few exceptions which are mentioned below:
It utilizes subword units, such as byte-pair-encoding (BPE) and unigram language model, and offers the capability of training directly from raw sentences. With SentencePiece, it becomes possible to create an end-to-end system that eliminates the need for language-specific preprocessing or post processing steps.
Let’s look at the areas where Chinchilla AI has performed better than existing models.
Chinchilla AI demonstrates superior performance compared to Gopher across all evaluation subsets of The Pile, with a perplexity score of 7.16 compared to Gopher's 7.75. However, it is important to exercise caution when comparing the two in these language modeling benchmarks.
Note that Chinchilla is trained on four times more data than Gopher which may introduce the possibility of train/test set leakage and artificially inflate the results.
The Massive Multitask Language Understanding (MMLU) benchmark comprises a comprehensive set of exam-style questions covering various academic subjects. Notably, despite being smaller in size, the Chinchilla model exhibits significant outperformance compared to Gopher, boasting an average accuracy of 67.6% and surpassing Gopher by 7.6%.
Interestingly, Chinchilla even surpassed the expert forecast for June 2023, which projected an accuracy of 63.4%. It achieves remarkable accuracy rates exceeding 90% in four individual tasks: high_school_gov_and_politics, international_law, sociology, and us_foreign_policy.
When evaluated on the LAMBADA dataset for final word prediction, Chinchilla achieves an accuracy of 77.4%, surpassing both Gopher's accuracy of 74.5% and MT-NLG 530B's accuracy of 76.6%. Additionally, Chinchilla significantly outperforms Gopher on the RACE-h and RACE-m datasets.
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark specifically designed to assess the capabilities of large language models and extrapolate their potential future performance.
In an analysis conducted on the same set of BIG-bench tasks, Chinchilla demonstrates superior performance compared to Gopher on the majority of tasks, similar to observations in the MMLU benchmark.
On average, Chinchilla exhibits a performance improvement of 10.7%, achieving an accuracy of 65.1% compared to Gopher's 54.4%. Out of the 62 tasks considered, Chinchilla performs poorer than Gopher on only four tasks, namely crash_blossom, dark_humor_detection, mathematical_induction, and logical_args.
On the Natural Questions dataset, Chinchilla achieves new state-of-the-art (SOTA) accuracies for closed-book settings, with 31.5% accuracy for the 5-shot scenario and 35.5% accuracy for the 64-shot scenario. In comparison, Gopher achieves accuracies of 21% and 28% respectively for the same scenarios.
On the TriviaQA dataset, results are provided for the filtered set - which has been previously used in retrieval and open-book approaches - as well as the unfiltered set - which has been used in evaluations of LLMs. In both cases, Chinchilla outperforms Gopher by a substantial margin, demonstrating its superiority in closed-book question-answering tasks.
It is believed that large language models, including the Chinchilla model, reflect the contemporary and historical discourse found in their training datasets, which encompass various groups, including gender groups. The Winogender test assesses a model's ability to correctly determine whether a pronoun refers to different occupational words. An unbiased model would accurately predict the word to which the pronoun refers, regardless of the gender associated with the pronoun.
In this context, Chinchilla demonstrates a higher rate of correct pronoun resolution compared to Gopher across all groups in the Winogender test. The improvement in performance is smaller for male pronouns, with an increase of 3.2%, in comparison to the increases of 8.3% and 9.2% for female and neutral pronouns, respectively.
Additionally, when considering "gotcha" examples where the correct pronoun resolution contradicts gender stereotypes based on labor statistics, Chinchilla consistently exhibits a more accurate pronoun explanation than Gopher.
Furthermore, when examining the examples based on gender and the presence of "gotcha" cases, the largest improvement is observed for female "gotcha" examples, with an improvement rate of 10%. This suggests that while Chinchilla consistently overcomes gender stereotypes in a greater number of coreference examples compared to Gopher, the rate of improvement can vary for different pronouns.
These findings highlight that the advantages conferred by using a more compute-optimal model may lead to uneven improvements in resolving gender-related pronouns.
In this article, we discussed how advancements in natural language processing (NLP) have led to the development of sophisticated LLMs like Chinchilla AI. A substantially smaller model, it delivers better accuracy and performance.
Chinchilla was trained on 70 billion parameters and four times more data than Gopher, which led to its outstanding performance. It outperformed other models in aspects like language modeling, MMLU benchmark, reading comprehension, BIG-bench, closed-book question answering, and gender bias. However, while impressive, the model is currently not open to the public.
Author is a seasoned writer with a reputation for crafting highly engaging, well-researched, and useful content that is widely read by many of today's skilled programmers and developers.