It’s all about the large language models (for now)
When the average person talks about AI, they’re generally referring to generative AI. That is, those cool little interfaces that allow you to produce reams of text or a dazzling picture with a simple text-prompt. And, that’s all very well and good. But, if you take a peek under the hood of your average generation AI tool (be it OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, or xAI’s Grok) you’ll see that they are underpinned by what are known as large language models - or LLMs as we’ll refer to them from now on. It’s these LLMs that are the beating heart of generative AI tools. But, what exactly are they? This is a drastic simplification, but LLMs are deep-learning algorithms that can perform a variety of natural language processing tasks. They ingest vast amounts of data (be it written or visual) and effectively ‘work out’ the relationships between different data points. This allows them to recognise the relationship between words and string them together in a statistically modelled best or most relevant sequence. In many ways, they mimic the human brain in a remarkable way (so much so that some scientists refer to LLMs as neural networks). As we’ve seen over the past two years, the capabilities of these LLMs - and their attendant generative AI platforms - have been impressive, finding use in myriad industries from sales and marketing through to engineering and healthcare. However, it’s possible that we’re reaching a ‘ceiling’ in LLM development. Which brings me on to my next point…The limits of LLM development?
Based on my fairly extensive reading of experts in this field, I would suggest that there are signs that the development of LLMs may be stalling and reaching a peak. Why? Well, there are a number of reasons.A lack of new training data
LLMs must be trained on data. In the same way that humans are (arguably) born as a tabula rasa, and must be educated into a fully-functional adult, LLMs must be trained upon data in order to provide their life-like feedback and responses. In fact, it seems to be becoming clear that performance gains for LLMs are (almost solely) obtained from the data they are trained on. Speaking to TechCrunch, Kyle Lo, a senior applied research scientist at the Allen Institute for AI (AI2) summed up the situation describing the difference between two new LLMs: “Meta’s Llama 3, a text-generating model released earlier this year, which outperforms AI2’s own OLMo model despite being architecturally very similar. Llama 3 was trained on significantly more data than OLMo”, which Lo believes explains its superiority. As another researcher, James Betker of OpenAI, put it: “Trained on the same data set for long enough, pretty much every model converges to the same point”. When you think about it, this makes sense. If you were to make every school-age pupil in the UK read exactly the same books, you’re going to end up with a group of people with shared assumptions, thought patterns and social norms. The same is true of LLMs. It’s all about the data. The LLMs must be fed with data. The thing is, the LLMs of all the major players have now ingested nearly all the world’s data (or at least the entirety of the scrapable Internet). Everything from the great works of literature to obscure manifestos sit within the bowels of the great LLM beasts (which is also why a lot of guardrails have had to be implemented to stop ChatGPT regurgitating 4Chan screeds in response to GCSE essay questions). There’s not a great deal of data left to be ingested by these LLMs - which means they’ve effectively reached the pinnacle of actualisation (if machines can truly ‘actualise’ in the sense that Maslow meant it). It’s for this reason that the big AI players have, over the past six months, engaged in the great ‘training data hunt’. They’re not exactly wearing pith helmets and carrying blunderbusses, but they do appear to be engaging in some fairly frantic efforts to obtain (and/or create) new training data to feed the ever hungry maws of their LLMs. This has led to a situation where AI leaders like OpenAI are paying hundreds of millions of dollars to obtain new sources of training data from the likes of news publishers, book publishers, stock media libraries and even institutional archives. Idea - want to make some serious money in the next decade? Become an AI data broker (the market is expected to be worth at least $30 billion by 2035). One response to this training data ceiling has been to see if it’s possible to create LLMs on ‘synthetic data’. In a move which brings to mind something akin to soylent green, companies are now turning to the use of synthetic data to train their models. This is - as its name suggests - data which has been created artificially (via decision trees, deep learning, and graphics engines) to mimic the characteristics of real-world data. This isn’t a new development, by the way. Synthetic data has been around for a long time - computer simulations being the archetypal example of artificial data. However, academics are suggesting that the use of synthetic data to train LLMs may not be feasible in the long run. Why? Because of our next point…The lessons of Habsburg’s jaw
There is a famous portrait of Charles II - the King of Spain from 1665 to 1700. Painted by Juan Carreño de Miranda it sits in the Kunsthistorisches Museum in Vienna. It depicts - as you would expect for a royal portrait - a man who is garbed in finery with a regal bearing. However, there is one element that stands out to the careful observer - the subject’s prominent lower jaw and generally lopsided face. Charles II wasn’t the only member of his lineage to bear such an unusual jawline (known as mandibular prognathism) - in fact, so many of his relatives had the same feature it garnered the name the ‘Habsburg jaw’. Why am I telling this story about a long-deceased monarch in an essay about A.I? It has been firmly established that Charles II’s distinct jawline was a result of inbreeding. Charles II had an inbreeding coefficient of .25 (which is phenomenally high). To put it into context, such a coefficient is the same as two siblings having offspring. The result wasn’t just a prominent jaw, but other health issues including epilepsy, an overly large tongue, infertility, gastrointestinal problems and more. I raise this point as LLMs are facing their own ‘inbreeding’ problem. To be more specific, they must avoid ingesting data that has been produced by other LLMs. Such recursively generated data can induce what academics term ‘model collapse’. It’s a situation which has been named Habsburg AI. AI companies now find themselves in something of a conundrum. Having been initially able to mine the Internet for data to train their LLMs, they no longer have this option - what with vast amounts of data online now having been produced by various LLMs. The authors of the Nature study, ‘AI models collapse when trained on recursively generated data’, summed up the situation as follows: “In our work, we demonstrate that training on samples from another generative model can induce a distribution shift, which - over time - causes model collapse. This in turn causes the model to mis-perceive the underlying learning task. To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remains available over time. The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale”. I’ve highlighted that last sentence to draw your attention to the key point. The Internet is essentially ‘dead’ as a source of new data for LLM training. From now on, AI companies will need to rely on institutional (and most importantly, proprietary) sources of knowledge and data for training purposes. If you have a large source of data that a) you own, and b) has never been uploaded to the Internet - now’s the time to make an approach to an AI company.AI’s energy quandary
As I’ve written previously, LLMs (and the AI platforms they underpin) consume truly vast amounts of electricity; I’m talking country-sized amounts of power. LLMs use electricity in two stages; the first is the training stage. As De Vries pointed out in his journal article, ChatGPT 3 consumed 1,287 MWh of electricity during its training stage. That’s enough energy to power 390 UK homes for an entire year. The second stage in which LLMs use energy is the inference stage. This is the energy the LLM uses to provide responses to inputs (e.g. the energy it uses to stay running following training). It has been estimated that ChatGPT consumes 564 MWh per day just to run and provide answers. That’s the same amount of energy that 170 UK homes use in a year - every. single. day. Indulge me, but I just want to restate that - every day, ChatGPT uses the same amount of energy as 170 UK homes use in an entire year (and, the reality is, this number is likely to be much higher). As you can imagine, this all comes at an enormous cost. In fact, it's so costly that AI firms are racing to find a solution. The current solutions on the table include:- Bringing shuttered power plants back online. In fact, Microsoft wants to bring the infamous Three Mile Island nuclear power plant back online.
- Some AI companies - such as Elon Musk’s xAI - are going ‘off grid’ and creating their own gas-powered power stations for their AI data centres.
- Other AI pioneers - such as OpenAI’s Sam Altman - are being hugely ambitious and pouring money into nuclear fusion (which, cynics have pointed out is ‘the energy of the future - and always will be’).
What’s next for LLMs in 2025?
Okay, so we’ve seen that LLMs, and the AI they power has made huge progress over the last two years. However, they’re facing a number of challenges. It’s my contention that these challenges will be solved - but LLMs (and the broader AI industry) are going to look very different by the end of this year.The data solution
How are AI companies going to solve the data issue facing LLMs, then? There are a number of possible solutions. The first is through simple brute force capitalism. OpenAI has just completed the largest venture capital deal of all time. In its latest funding round, the generative AI progenitor raised $6.6 billion at a valuation of $157 billion. That valuation makes OpenAI the third most valuable VC-backed company in the world, surpassed only by SpaceX and ByteDance (the owner of TikTok). OpenAI now has the funds to buy any data it wants. In theory, the company is now in a theoretical position to buy the entire back catalogue of Penguin Random House or the rights to tens of thousands of movies. This isn’t mere idle conjecture, either. Earlier this year Meta (the parent company of Facebook and Instagram) briefly mulled over purchasing Simon & Schuster. In short, AI companies may solve their current dearth of data with cold, hard cash. I won’t discount that these companies may also solve the Habsburg AI problem, but in the near term I suspect they’ll prefer to just use their capital to acquire new data sources.The domain-specific LLM solution
Okay, so we are onto the crux of this article. There is an emerging type of LLM that can potentially leapfrog the current obstacles in the way of LLMs. Domain-specific LLMs. Domain-specific LLMs are language models which have been trained in a certain knowledge domain. They are explicitly designed to excel within certain areas of expertise. Ask them the sort of more general questions that you would have on ChatGPT4, and they’ll likely fail. But, ask them a question in the domain in which they have been trained, and you’ll get a superb answer. Domain-specific LLMs are already extant and being used ‘in the wild’. Examples include:- ClimateBERT - an LLM that is trained on climate-related information.
- Med-PaLM2 - an LLM that is designed to provide medical diagnoses.
- BloombergGPT - an LLM that has been created and trained for financial forecasting purposes.
