Choosing the right approach to utilize LLMs for enterprise GenAI applications

May 9, 2024•7 min read

LLM training and enhancement

Generative artificial intelligence (GenAI) is leading a transformative wave across industries with its ability to autonomously generate content, whether it be through text, images, videos, or even intricate 3D protein structures. The breakthrough came with large language models (LLMs) like ChatGPT that surprised specialists by how well machines could understand and generate humanlike content.

Leading enterprises are already embracing GenAI to innovate and improve their operations. From Morgan Stanley equipping wealth managers with AI copilots to Coca-Cola's collaboration with digital artists for AI-driven branding content, the applications are as diverse as they are impactful. According to a Fortune Business Insights report, the GenAI market is projected to reach USD 967.65 billion by 2032. Companies that integrate GenAI now can position themselves to leverage this anticipated growth and reshape their operational landscapes.

In this blog, we will dive into the methodologies for leveraging LLMs and outline the potential approaches enterprises can take to harness this groundbreaking technology effectively.

Let’s get started!

Different approaches to utilize LLMs for building enterprise GenAI applications

Enterprises can utilize LLMs in three major ways to build their own specific GenAI application. These approaches, listed from lowest to highest complexity, include prompt engineering, retrieval augmented generation, and fine-tuning.

Approaches to utilize LLMs

Let's explore these methods in more detail.

Prompt engineering

Many businesses will start their LLM journey with this method, as it is both time-saving and cost-efficient. This method utilizes APIs from third-party LLM providers such as Anthropic, OpenAI, or Cohere, or using open-source LLMs like Llama, Falcon, and Mistral through prompts. A prompt is a task-specific textual instruction that is expected to give a desired result. For example, the below prompt instructs the LLM to generate sentences similar to the input text.

Prompt engineering example

But because these LLMs are general purpose, they may not provide an answer to a question unless it’s framed in a specific way to elicit the desired response without additional guidance, as shown in the above prompt example. Therefore, crafting these prompts, a process known as prompt engineering, requires creative writing abilities and numerous iterations to achieve the optimal response.

These prompts can be enhanced with examples to help guide the LLM. These examples are placed before the actual prompt in what is referred to as the “context” of the prompt. Providing examples in this manner is known as one-shot prompting and few-shot prompting, depending on the number of examples provided.

Let’s understand with an example:

one-shot prompting and few-shot prompting

Another common prompting approach is in-context learning (ICL), which entails framing task descriptions and/or demonstrations in natural language text. Furthermore, the use of chain-of-thought (CoT) prompting can augment in-context learning by including a series of logical reasoning steps in the prompts to guide the model more effectively.
Let’s understand with an example:

A comparative illustration of ICL and CoT prompting

Image source

Retrieval augmented generation (RAG)

Foundational models are often trained with general domain data, which can limit their ability to produce responses tailored to specific fields. Therefore, enterprises may opt to implement LLMs using their proprietary data to build applications relevant to their industry (such as a real-time language translation app or a model to generate in-depth financial reports, forecasts, and analyses). This also allows for the generation of responses using confidential or the latest information.

In such scenarios, companies can use RAG to enhance prompts by integrating external data, which could include an entire document or specific segments of it. This data is then included as context in the prompt, along with the user's query, that enables the LLM to provide accurate responses based on that input.

Let’s now look at the functional architecture diagram that outlines the various components essential to building an enterprise RAG application.

Functional architecture diagram of enterprise RAG applications

This architecture comprises of three stages:

Pre-retrieval

First, data is ingested from specified data sources and then decomposed into smaller text chunks to facilitate searching and ensure compatibility with the token limit of LLMs. Subsequently, various embedding generation language models like OpenAI’s ada, bge-large, and all-mpnet are used to generate embeddings for these text chunks. These embeddings are then indexed and stored in a vector database like ChromaDB or Pinecone. This process is performed offline and is a one-time step for all necessary documents. In some cases this step is repeated if additional documents are added to the corpus.

Retrieval

During user query time (or online inference), retrieval begins by passing the user query through an input guardrail to anonymize any sensitive information, restrict certain query topics or code, and detect prompt injection, among other tasks. Following this, query understanding and rewriting take place to ensure that the query intent and attributes are understood, which makes the query suitable for transmission to the LLM. In the next step, the embedding for the query is generated and matched to the corpus of chunks stored in the vector database using various similarity metrics to retrieve the most similar chunks.

Post-retrieval

When retrieval from the vector database fails to deliver optimal quality, a reranker is used to enhance the chunk ranking. A common approach is to use open-source, encoder-only transformers such as BGE-large in a cross-encoder setup. Recently, decoder-only methods like RankVicuna, RankGPT, and RankZephyr have significantly enhanced reranker performance. Finally, the prompt is framed, comprising the user query, the retrieved reranked document chunk, and some system prompt. It’s then sent to the LLM for response generation. An output guardrail is implemented to check for hallucinations and evaluate the response quality. If satisfactory, the LLM response is first cached for later reference and then provided to the user.

The RAG framework reduces the risk of producing misleading information by integrating data directly into the prompt rather than relying solely on the LLM's internal knowledge base. Nevertheless, it's crucial for businesses to recognize that this approach to information retrieval isn't entirely immune to inaccuracies. The quality of the data supplied and the retrieval techniques used impact the precision of the LLM's outputs. Additionally, it's important to note that including proprietary or sensitive information in the LLM request raises concerns about data privacy and expands the token window, resulting in increased costs and delayed response times for each request.

Fine-tuning

To address the previously mentioned limitations of RAG, the fine-tuning approach can be applied. In this process, the LLM integrates the knowledge from your fine-tuning dataset directly into its structure by updating its internal weights. After fine-tuning, there's no need to include examples or additional information within the prompt's context. This technique eliminates issues with token size limits, mitigates privacy concerns, and improves response times. The integration of the fine-tuning data's full context into the model also results in higher quality and more generalized responses.

Finetuning_process

However, fine-tuning is most effective when there's a significant volume of instructions to learn from, typically in the thousands, and it demands considerable resources and time. Beyond the fine-tuning process, considerable effort is required to compile a dataset in the appropriate format for tuning.

Click here to learn how Turing helped a client enhance its LLM accuracy and functionality with 115+ evaluation datasets and 5,000+ RLHF interactions.

The future: What’s next?

The transformative potential of enterprise GenAI is vast and can significantly impact various aspects of business operations such as employee productivity, customer engagement, cost reduction, risk analysis, and decision making. However, navigating the adoption of GenAI presents major hurdles, including the lack of a solid GenAI strategy, lack of technical expertise, and difficulty in identifying suitable approaches for utilizing LLMs, among others.

To thrive in this AI-first era, businesses need an AI-powered tech services partner like Turing who understands foundational LLM models and has a deep bench of AI talent. Over 1,000 companies, including many of the world’s leading LLMs foundational model companies, have partnered with Turing to build enterprise GenAI applications, train and enhance LLMs, and hire on-demand technical professionals.

Our vast experience in GenAI is demonstrated through our successful projects, such as AI coding assistants, enhanced ranking systems, refined LLM’s responses, and website chatbots for better lead generation. Turing's access to a global talent pool of over 3 million developers across 100+ countries enables us to harness diversity for innovation, driving excellence in AI and GenAI initiatives.

Accelerate your customer success, internal productivity, market share, and more with GenAI technology. Talk to an expert today!

Want to accelerate your business with AI?

Talk to one of our solutions architects and get a complimentary GenAI advisory session.