Metrics, scenarios and leaderboards: measuring GenAI’s effectiveness as a genuine business tool

Technology

What can the C-suite rely on when trying to navigate the emerging genAI scene?

While 2022 will be remembered as the year ChatGPT was launched, 2023 will be seen as the year that GenAI started to cross-pollinate other digital solutions, and GenAI-powered applications started to proliferate.

In the first few months after ChatGPT’s launch, it was the only conversational AI software based on large language models – which, having learned statistical relationships from texts on the internet, can generate human-like texts themselves. At this stage, people were largely just curious about what ChatGPT could do and occupied mainly with tinkering with the software to explore its limits.

The genie truly came out of the bottle when META released its LLaMA family of four different models, which were designed to outperform its chatbot rival both on major benchmarks and potential for business applications.

The original LLaMa model was created to be used for research purposes only. However, one week after it was announced, the full model, with instructions on how to run it, was leaked on a Japanese message board.

From that point, the genealogy of LLMs is becoming increasingly entangled. An entire stableload of applications have been developed, such as Vicuna, Stable AI’s answer to ChatGPT4, the University of Stanford’s Alpaca, finetuned from a 7b LLaMA model, Large-Scale Artificial Intelligence Open Network’s Open Assistant, or Hugging Chat, machine learning developer Hugging Face’s new open source conversational AI. Without Meta’s LLaMA, none of these applications would have been nearly as good as they were when published.

Evaluating LLM evaluation frameworks

The call for evaluating language models’ capabilities predates ChatGPT’s emergence. The classic model of HELM (Holistic Evaluation of Language Models) designed by Stanford’s Center for Research on Foundation Models, had already been preceded by several iterations when it was published last November, a few weeks before OpenAI’s game-changing software started its breathtaking career.

The goal of the HELM project has been to make GenAI’s decision-making process transparent. Its original, or “classic”, version evaluates accessible language models based on a matrix of core and targeted scenarios – or use-cases – and seven metrics, which include accuracy, robustness, bias and toxicity.

Its 42 scenarios combine traditional natural language processing tasks (question-answering, text summarisation and sentiment analysis) and benchmarks used to test the limits of LLMs’ enhanced capabilities, such as text generation, reasoning and tool use.

Being dynamic, the HELM evaluation framework will be regularly updated with new metrics, scenarios and models – of which there are currently more than 30 across 12 organisations.

However, LLM testing is a fairly resource-intensive affair, as is LLM development. Having realised this, The Center for Research on Foundation Models leveraged the benchmark’s modular nature to introduce HELM Lite, a lightweight-yet-broad benchmarking tool, in December 2023.

The first HELM Lite tool focuses only on capabilities, dropping some functions while using proxies for others, and is to be followed by another evaluation tool for testing LLM safety.

To make selecting the LLM most suited to specific use-cases more difficult, different leaderboards, such as those of HELM’s, EleutherAI Harness’s or Hugging Face’s Open LLM’s, may show different rankings for the same group of large language models.

For example, in June 2023, there was an outcry when Falcon, an LLM from the UAE, came top of the leaderboard which tested the knowledge of models in a wide variety of subjects at various difficulty levels, using multiple choice questions – beating LLaMA, which scored lower in Open LLM than in other leaderboards.

Repeated tests run by researchers proved that the lack of standardised evaluation methods will result in similar discrepancies in rankings, a problem that needs to be promptly addressed.

There was some more buzz around leaderboards last July, when Palmyra, the LLM of a San Francisco-based start-up, ranked first in several important tests on HELM, which led to Palmyra X V3, a new iteration of Writer’s latest LLM model, outpacing even Google’s PaLM 2 on HELM lite.

An LLM finetuned to business use cases

Palmyra is regarded as the first full-stack generative AI platform built for business. It’s large language model comes in three sizes and includes an all-purpose foundation model layer and an application layer.

Palmyra is ranked on HELM Lite as a model pretrained on business writing and marketing data, which businesses can integrate into their systems and products following some fine-tuning with their own data and style guidelines.

As GenAI is gradually evolving from a shiny new tool into a real business solution, the security and ethical aspects of its use will become more central. These models’ alignment with human preferences, intentions and values will be one of the top priorities in their further development.

Alignment of LLMs with real-life needs can happen by feeding curated rather than unstructured internet data into the LLM, as in Palmyra’s case, or via reinforcement learning with human feedback. By adopting these approaches, LLMs can get cheaper and faster while still matching the accuracy of the largest proprietary language models trained on unstructured internet data.

To allay businesses’ fear of their sensitive proprietary data leaking out as training data for these more advanced LLMs, vendors’ pitches will increasingly promise that prompts, text and pictures created for and by GenAI won’t leave the business’s walled gardens and become leveraged in a way similar to how current models were trained on copyrighted texts found on the internet.

As it’s still early days, it’s hard to tell what the enterprise-grade applications of large language models will be like – whether the majority will be large or small, pretrained or finetuned, or how programmers will ensure these models kick the habit of coming up with plausible but fictional answers as they “hallucinate”.

Although more improvement on LLMs is needed to align current models with real-world use cases, no business can afford to ignore them as they are destined to become table stakes in the medium term.

Digital Transformation