Building enterprise solutions with Generative AI, particularly Large Language Models (LLMs) like GPT-4, Claude 2, and others, is akin to navigating the Bermuda Triangle, marked by three critical constraints: cost, latency, and relevance. Crossing any one of
Close up stock photograph of a mature man studying a see-through computer monitor that’s displaying … [+] text provided by an AI (artificial intelligence) chatbot.
getty
the three can sink a project, wasting time and money.
Everyone is excited about the technology’s potential, but there are significant challenges yet to be overcome before it transforms businesses. There are strategies, however, that innovative companies are using to successfully deploy LLMs at scale – but they involve trade-offs among the three sides of the triangle. Lowering cost usually decreases relevance. increasing relevance usually increases cost and latency. Lowering latency is usually more costly Finding the right mix is an optimization problem.
First, there’s the cost of developing, training, and maintaining a foundation model. Few companies can afford building their own model from scratch and so they typically rely on models accessed through APIs, either closed source (e.g., OpenAI’s GPT-4, Google’s Gemini Pro, Anthropic’s Claude2) or open source (e.g., Meta’s Llama2).
Closed source models are typically the most effective and easiest to use, but accessing them comes with potentially significant costs, especially for the most performant ones. Each token generated by LLMs incurs a cost, and given LLMs’ tendency for verbose output, a significant portion of that money is wasted on redundant or irrelevant computation. Open-source models are typically less costly to use but require more engineering capabilities to deploy and maintain. The use of smaller models (e.g., GPT3.5 instead of GPT4) can also be a way to decrease usage costs but can negatively impact relevance.
The second dimension is latency. Because LLM providers themselves have limited compute, they restrict the number of tokens that can be processed per minute – so called rate limits. That means real-time processing is nearly impossible for large-scale applications that require processing millions of tokens per minute. Latency levels above a few seconds can significantly hinder the adoption of any AI-based application. There are ways to improve latency – such as leveraging private clouds to avoid “sharing” LLMs with other companies, but which increases cost; or using smaller models, which as described can negatively impact relevance.
The last, and most critical dimension is of course relevance. The ability of Generative AI systems to generate accurate, contextually appropriate, and useful output is essential for user adoption, and therefore business impact. Despite their impressive capabilities, LLMs often produce outputs that require significant post-processing to meet specific criteria. This dimension is also the hardest to measure, as it is often based on qualitative assessments.
Relevance can be improved by injecting more information into a model. This can be done through prompt engineering (giving the model the right context and instructions), Retrieval-Augmented Generation techniques (allowing models to access external information in a trusted knowledge library) or fine-tuning (training the large language model with additional datasets). Each of these methods has its pros and cons, but they all come at the expense of latency. Some, like fine-tuning, can be particularly costly, and can lock enterprises into a particular model – if they switch to another model, all the fine-tuning work is lost.
So how do companies navigate this Bermuda Triangle of Generative AI, balancing the computational power of the models with the speed, precision and applicability of their outputs?
Techniques like parallelizing requests across multiple older models, chunking up data, model distillation, and using less resource-intensive models can help. For example, instead of calling a single model during inference, companies may consider calling multiple models simultaneously, routing subtasks to less expensive models and saving the most expensive models for the most difficult or critical tasks. AI agents can also access other systems or tools through API’s because some subtasks can be handled by simpler tools or techniques.
One solution BCG designed for a global consumer-facing company building a virtual assistant used such a hyper-parallelized architecture to optimize latency and cost efficiency. The architecture allows the system to make multiple LLM calls in parallel, reducing the response time to just seconds per answer.
User input is first classified to determine whether the LLM should provide an automatic answer or if it should use a category-specific business logic. Depending on the classification, the system retrieves relevant data from proprietary knowledge bases, such as product databases, customer care logs, conversational data from past interactions, and external services accessed through APIs, with different LLMs pulling specific data simultaneously to minimize latency. An LLM then uses the data to construct a response. To optimize cost, the company leverages a model-agnostic architecture, and switches between models depending on the task to be performed, using the cheapest model that performs the task at hand accurately.
Chunking is the process of breaking down extensive text data into smaller, manageable segments for efficient processing, ensuring semantic relevance and minimizing noise. Distillation is the process of training smaller models using larger LLMs, creating accessible, specialized models that require less training data while maintaining performance.
And remember, foundation models are not the Swiss Army knives of AI. Use cases that do not directly contribute to enhancing customer service, creating new revenue streams, or addressing specific business needs may not need generative AI at all.
By considering alternative strategies, enterprises can effectively harness the potential of generative AI. But no single solution exists – it requires optimizing architecture and workflows to balance cost and capability. Orchestrating LLMs, human oversight, and various AI tools into an efficient symphony is key. And the solutions remain iterative as technology shifts. The technology is changing fast, but confronting the tradeoffs is essential to avoid disappearing into the Bermuda Triangle of generative AI.