By Michael N. Price
Artificial Intelligence (AI) continues to evolve at a breakneck pace, and it’s become near impossible for organizations to keep up with the latest industry trends and technological advancements.
As new and updated models continue to become available, innovative experimentation and practical research are proving crucial as organizations of all sizes look to improve and better understand the business applications of these complex and massively disruptive technologies.
At Boomi, the industry leader in intelligent integration and automation, our innovation experts recently designed an experiment to create a method to evaluate leading AI models side by side. The goal: help AI practitioners and line of business users evaluate the rapidly growing menu of AI tools, models, and technologies.
To do this, Boomi leveraged its platform to measure the performance of multiple AI models simultaneously, comparing generative AI models that leverage Retrieval-Augmented Generation (RAG) technologies against “vanilla” Large Language Models (LLMs) while also exploring cutting-edge implementations released by generative AI’s industry leaders.
Chris Cappetta, Principal Solutions Architect and a leading AI expert in the Boomi Innovation Group, initiated the recent experiment to investigate and better understand the ever-increasing maze of newly emerging AI capabilities and techniques. Through this analysis, Boomi’s innovators hope to determine which tools and designs might be the most effective for future implementations of Boomi AI use cases and provide the most value to customers.
Because the available pool of AI tools and designs continues to grow, seemingly by the day, the team’s primary goal was to separate speculation from evidence-based strategies. Using Boomi’s platform, Cappetta designed a side-by-side analysis of several prominent AI designs to discern which model yielded the most consistent and highest-quality responses.
Cappetta, who filed a patent in 2019 (awarded in 2023) for an invention focused on extending Natural Language Processing AI capabilities with business-specific logic, orchestration, and connectivity, said the recent mass-proliferation of AI tools and strategies has left companies with no shortage of options, but there remained a notable lack of concrete, evidence-based recommendations on which can deliver the best-operationalized results.
“The bulk of the work that is evidence-based is at the level of people training the actual models,” Cappetta said. “But it’s research and development that is so nuanced, it’s layers deeper than what people at large organizations can actually act on while making real-world business decisions with real-world impacts.“
Cappetta added that the rate of AI innovation, mixed with the realities of modern social media, has created a culture of speculators and creators who race to beat the rest of the pack when publishing content that helps others understand new developments. Organizations fall into this pattern, too, quickly implementing the latest tools in a way that captures attention but may not be suited for long-term use.
“A lot of organizations will put out stuff in sort of the speed mindset to get something out there that’s gonna get visibility . . . but it may be that after you run it for a month, you change your opinion about how that idea works,” Cappetta said.
Retrieval Augmented Generation vs. Vanilla LLMs: A Comparative Study
To change that, Cappetta built a Boomi process designed to leverage multiple AI models simultaneously and designed a scoring system to evaluate each model’s individual performance. This allowed Boomi innovators to create a quantified analysis of AI performance in real-world scenarios, using real-world data.
One of the experiment’s central hypotheses was that RAG designs would outperform standalone vanilla LLMs by providing more accurate and high-quality results. Spoiler alert: they did.
Cappetta said this suspicion was strongly affirmed as RAG models, which integrate external data retrieval into their response generation, demonstrated superior performance over their base-LLM counterparts. While not entirely surprising, this finding highlighted RAG’s potential in applications requiring up-to-date, specific information.
In fact, Boomi’s experiment found the improvement in response quality from vanilla GPT-4 to the GPT-4 RAG design even outweighed the quality jump from OpenAI’s ChatGPT-3.5 to GPT-4.
This surprised Cappetta’s team because one of GPT-3.5’s most significant limitations is its ability to access the more up-to-date information that the GPT-4 model uses. The fact that employing the GPT-4 RAG design resulted in such a noticeable leap in response quality, even when compared to the base GPT-4 model, only further demonstrated the value created by RAG technologies.
Another unexpected conclusion came when assessing different technologies built onto GPT-4. Cappetta’s team evaluated the performance of OpenAI’s Assistants API when compared to the Chat Completions API, testing the response when each was provided identical data and near-identical instruction (or as identical as possible given structural differences).
The experiment found that the Chat Completions API strongly outperformed the Assistants API, even when both leveraged RAG to reference external documentation to guide a response.
“It appears that the Assistants version of the GPT-4 model focused too strongly on the reference material it was provided, creating a summary of that material instead of using it as context to build and support its answer to the user’s original question,” Cappetta said. “The Chat Completion model, on the other hand, was more effective at using the referenced material to build a meaningful answer.”
Boomi’s team concluded the underperformance of the Assistants API suggests that the model’s specialized training and additional guidance and steering mechanisms may inadvertently detract from its ability to focus on the user’s original queries.
Semantic Search: Synthetic Data vs. Raw Data
The team uncovered another unexpected revelation when comparing synthetic data augmentation with direct semantic searching against raw data. Contrary to initial expectations, searching against raw data chunks yielded comparable, and sometimes better, results than using synthetically generated questions.
The team’s initial suspicion was that those surprising results might be attributed to a flaw in the prompt phrasing that created the custom scoring system. But even after modifying those instructions and re-running the tests, the semantic searching of raw section content continued to outperform semantic search of synthetic questions.
“We knew rich data was useful, but the revelation was that the specific semantic search design we were using saw better performance from searching larger chunks of information even if the information was intuitively less of a directly close comparison to the structure of the question,” Cappetta said.
Boomi’s team dove deeper with a second layer of testing that examined which designs retrieved the highest quality or “best” content, using both raw data sets and synthetically enhanced data with various chunking and tagging methods.
“What really struck me from this whole project was that the leading models are fairly well established, but the leading methods of loading context are a vast and variable landscape,” Cappetta said.
While the list of fully established and trusted LLMs is growing, they could still largely be counted on two hands. At the same time, there are countless methods to prepare, chunk, tag, enhance, search, and retrieve relevant context to provide to those LLMs, and any of them could end up playing a role in an AI implementation.
Future Directions and Applications
Looking ahead, these findings open up numerous possibilities for organizations striving to find practical applications of AI. The insights gained from this experiment provide a roadmap for leveraging AI more effectively across various industries and orchestrating more meaningful AI activities, from enhancing chatbot interactions to refining search algorithms.
This, Cappetta said, is where Boomi’s expertise shines. While industry giants like OpenAI and Google will continue to lead the way in developing future LLM models, Boomi is uniquely positioned to use its platform to orchestrate larger and more practical use cases.
“We have a wheelhouse, home run scenario where we can orchestrate the larger use cases that include those technologies,” Cappetta said. “I think if Boomi is booming in 2028, it will be because what we have already been doing for decades can now be leveraged as an orchestration layer, involving lots of AI calls and AI use cases to be that sort of orchestration framework.”
The Boomi Innovation Group’s work continues to demonstrate the power of hands-on experimentation and analysis with various AI technologies. As we move deeper into 2024, armed with these new insights, the journey into AI’s vast potential continues, promising to revolutionize how we interact with and benefit from these technologies.
Michael N Price works as a Program Manager at Boomi. Learn more at Boomi.com
