TL;DR: Models are growing faster than the rate of hardware improvement, so the way to catch up is by throwing many (really many!) GPUs to the problem. And things are only going to get worse. Whereas some time ago one might have frowned at the prospect of distributed computing, today it is a feature of the AI landscape. In this new paradigm it is important to widen the doors to foster innovation.
Current computing trends
In the last decade, we have seen a steady growth in model size [4] that averages to 4.1x per year. Behind this trend is what's has been coined as scaling laws of Large Language Models (LLMs) [12]. In layman's terms, the bigger the model (more data to train on, which demands more compute) the better the performance. These empirical laws make performance predictable, given the amount of data, compute and model size [12]. Thus companies chasing state-of-the-art performance do so by following this law, upping the parameter count of their models. Figure 1 shows how Meta, OpenAI and Google have followed this trend.
At the same time, although GPUs continue to improve, they do so at a steady but slow rate (approximately 30% year on year). This results in the gap between model demands and the capacity of the hardware to run them widens every year.
Distributed computing is no longer an option, but a must; due to the skyrocketing demand in computation from AI models, multi-node (multiple machines) computing is needed to meet training demands for AI models. AI demand cannot wait for GPUs to get big or fast enough. As a result, organisations are throwing thousands of GPUs to meet the demands of the hungry LLM scaling laws and get the best performance possible.
We already live in the future
LLM requiring multiple machines is already common place today. Companies such as Meta and xAI have already made heavy infrastructure investments to train their LLM models. Meta reportedly used 24,000 GPUs to train the latest LLaMa-3.2 models [6], for 20,000 needed by Grok 2 [7].
They are not stopping there. Meta plans to increase their infrastructure to 600,000 GPUs [8] whilst xAI has already announced its intention to double its GPU count to 200,000 [9]. They are placing a large bet on the future of AI requiring vast number of computing devices, which evidences the strength of the belief on these trends by large tech companies, beyond a neat description of the past.
What if I just want to run pretrained models?
If scaling laws affected only the training of foundation models, what's the big deal? Maybe this puts all of us as users of those big tech companies that can afford to build them, but we still get to use them, right? We benefit from their tech grinding. Putting aside the business and technological dependencies this creates, the problem is the scaling laws spill over inference too; i.e. large models perform better but by virtue of being larger, they require more hardware to run on. The entire life cycle of AI workflows is affected by this scaling.
Take inference of LLMs. To help visualise the impact of widening of the gap, let's take the inference requirements of a smaller version of LLaMa-3 with 30 billion parameters. Today we can fit this model in 2 GPUs (your mileage may vary). Following both model growth and hardware improvement rates, the 5-year equivalent model will require 550 GPUs! That's a 275x multiplier.
So, scaling laws are not just a training concern, but affects inference as well. Whether it is to do inference on large data silos, finding the right training and fine tuning hyperparameters, or deploying multiple AI agents, chances are you’ll need a hardware fleet.
Quantized models don't require as many resources, right?
Sure, quantized models make running large LLMs more affordable by reducing the effective size of models. However, as shown in Figure 4, scale still matters even for quantized models. The bigger the model, the better the quality. So there's still an incentive to get models as large as possible.
Furthermore, a model can only be quantized if it has been built and trained at scale first, so this does not remove the issue.
Innovation, the victim of the scaling laws
Who's the victim of these rat race for model sizes? Everyone. Everyone wanting to deploy LLMs and other large models, that is. But particularly researchers and small and medium businesses that don't have the body count for a large engineering team and procurement capacity. With average cost of high end GPUs at $30,000, not many can afford to own the infrastructure required. Cost of renting them is not cheap in the long run either, with estimates on running LLaMa 3.1 in the $1M / year [16]. More and more are relying on service subscriptions as a short term affordable plan to get into AI. But this makes you play by their rules and seriously undermines privacy of critical business data.
Opportunity
In this landscape of runaway scaling, access to hardware is going to be a strategic advantage not just for business, but also to foster innovation. Cloud providers and data centres are rushing to buy H100 and H200 as they know there is strong demand for GPU compute, but many will be either priced out or deprioritised in favour of more likely to spend bug customers. Companies and researchers should derisk single providers to prevent vendor lock-in and being deprioritised.
Paradoxically, as high end GPUs are in high demand and often in shortage, NVIDIA continues to set records of consumer GPUs. Many of those are perfectly capable of running heavy AI workloads, and often they lay unused in their owners home. They are also getting better bang for their buck, with faster reduction in FLOPS per dollar compared to enterprise counterparts [1]. A quick estimation from Steam usage stats [15], there are over 57 million of those (RTX models), and that’s just in Steam. A revolution waiting to happen, putting the capital D in Distributed computing.
More to come!
The new distributed paradigm, whether on a single data centre or truly distributed, brings new challenges at scale:
- Communication overhead
- Reliability and resilience to failures
- Technological tailwinds pushing distributed computing [2, 3, 10, 11]
- Heterogeneity of resources
In the next posts we'll be discussing some of those in more detail.
Want to participate in the conversation? Head over to our community and share your thoughts with us.
References
[1] https://epochai.org/blog/trends-in-gpu-price-performance
[3] https://incompliancemag.com/the-6g-future-how-6g-will-transform-our-lives/
[4] https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year
[5] https://epochai.org/blog/trends-in-machine-learning-hardware
[8] https://www.theverge.com/2024/1/18/24042354/mark-zuckerberg-meta-agi-reorg-interview
[9] https://uk.pcmag.com/ai/154179/musks-xai-supercomputer-goes-online-with-100000-nvidia-gpus
[10] https://arxiv.org/abs/2312.08361
[11] https://arxiv.org/abs/2311.08105
[12] https://arxiv.org/abs/2001.08361
[14] https://github.com/ggerganov/llama.cpp/pull/1684
[15] https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam
[16] https://www.linkedin.com/pulse/million-dollar-trick-llama-31-free-own-costly-run-mazen-lahham-errwf/
Kalavai for accessible distributed computing
We believe distributed computing is not only the present of AI, but the future of computing in general. We want to pave the way to make sure everyone gets access to effective compute, the key resource for AI.
Try our open source, free platform now.