The pressing need for LLM error characterization (instead of just LLM proliferation)

Mon, 19 Aug 2024 01:18:08 -0700

Tags: technical

permalink

In AI for the last year and a half, the focus has been on LLMs, as they succeed on tasks that are hard for humans. When humans are able to accomplish some of the feats that LLMs are able to do, it shows a level of proficiency that can be translated to many other tasks.

Proficiency in a task goes hand in hand with the type of errors committed when performing the task. In a past life, I was a late cofounder in an elementary math education startup. We worked in understanding the erroneous logic behind children's miscalculations in say, fraction additions. With this understanding we could then explain to the students the error in their logic and steer them into the correct solution. This was possible because humans will err in similar ways (unless a kid is an unusually creative failure innovator). The way an intelligence fails constitutes its error profile.

While the LLMs show human-comparable proficiency in some tasks, their error profile is very non-human. I have argued in an earlier post that the most exciting and novel insight I gained during the Jeopardy! Watson system construction was peeking into its non-human error profile. The same can be said with my work with LLMs so far. It is exciting to live in a time where non-human intelligences (as primitive as these machines are) start to appear.

Aside from the excitement, this non-human error profile means our intuitions for LLM behaviour are uncalibrated. There are clear engineering drawbacks of this lack of understanding for LLM error profiles: it makes it very hard to use them effectively. Engineering specs for IC components include behaviour diagrams; a tool is not only useful when it succeeds but also when its problems can be kept under control.

Yet, building a comprehensive error profile for LLMs is elusive (but not impossible, I cite my favourite paper on the topic at the end). Their parameter space is gargantuan. They are very prone to butterfly effects, where small changes in the wording (or just the temperature parameter) turn into completely different outputs. And when the outputs are wrong, they defy human logic.

This problem is exacerbated by the fact that LLMs themselves are a moving target. The error profile for, say, different versions of GPT-4 are different enough that people claim "it's getting worse" (maybe it is not getting worse, maybe it is getting different enough that it leads to worse outcomes in some tasks for some people to notice and complain). This is compounded by vendors retiring LLMs very quickly making investments on understanding and containing particular error profiles moot. Instead, the narrative is to move to a newer offering that will have less intrinsic error. This game of three-card monte goes against good engineering practices but makes short term marketing sense: the product cannot be faulted because it is always "new and improved" (customers can be patient with that game for very limited time).

To wrap-up, in their article at Communications of ACM, June 2011, "Computer Science Can Use More Science", Profs. Morrison and Snodgrass argued that:

Software developers should use empirical methods to analyze their designs to predict how working systems will behave.

and that

Just because we design and build computational systems does not mean we understand them; special care and much more work is needed to correctly characterize computation as a scientific endeavor.

Characterization of the error profile of LLMs is scientifically possible. For a great example, take a look at the TGEA paper: "An Error-Annotated Dataset and Benchmark Tasks for Text Generation from Pre-trained Language Models". The researchers in the paper generated more than 40,000 texts using GPT-2 and then evaluated them by hand into a number of error categories derived from analysing the outputs with linguists. As such, I am looking forward to seeing more work in characterizing error profiles so we can properly use LLMs as robust building blocks in building Gen AI applications.

TL;DR: less LLM hype, more science will lead to better engineering.

Comments

Your name:

URL (optional):

Your e-mail (optional, won't be displayed):

Something funny using the word 'elephant' (spam filter):

Your comment: