We are inspired by the largest open distributed training attempt to train an open source 10B model [4]. And it got us thinking. Has there been a community attempt to run LLMs at the largest scale possible? No quantization, just maximum precision weights.
This is what we built Kalavai for, to make AI accessible to all. And right now, models +70B are beyond individual users and small organisations. Note that it is not a matter of performance alone, they just cannot gather enough resources, and then manage them, to run truly fat LLMs. So we thought, if we are to offer value to the community, we must be able to run those models.
For that, we have set ourselves a task: running a full 405B parameter model (no quantization) in community owned hardware (i.e. consumer grade devices) and make it available for public inference. Your H100s are welcomed too, but I assume many of us are short of those anyways.
But we cannot do it alone. We need the support of the community we want to serve. Want to be part of it? Join us to register your interest, or jump over our discord channel.
If you want to get your hands dirty, head over our open source repo and get deploying!
Details
Why?
Because alone, we can't. But together, we have more computational power than the biggest cloud in the world. We want to show the world that not only this is possible, but it is indeed practical to use consumer devices across locations. For us to be able to demonstrate practicality, we've set our goals high. We want to run the largest open source available today; and if that wasn't challenging enough, we want to deploy it 2 times over . Yes, not just one LLaMa 405B, but 2 replicas.
What model?
Meta LLaMa 3.1 largest model 405B parameters [2]
How many people will it take?
405B parameters amount to approximately 1944GB in 32 bit mode. The average consumer device GPU has around 8GB vRAM. Thus, we'll need at least 244 computers per replica, a minimum of 488 devices in total (possibly more!)
How?
We'll use Kalavai, and the strength of the community. Kalavai is a platform designed to turn everyday devices into an AI cloud. With enough devices from the community, there is nothing we cannot do.
What hardware do you need to join in?
If you have an NVIDIA GPU* less than 10 years old (GTX 10 series onwards, RTX, Quadro) we need you! We are looking for people with AMD cards too, but we may need to focus on NVIDIAs first.
If you have a GPU and want to be part of this, sign up here to register your interest.
* If you happen to have a data centre with T4, L4, P100, V100, P40, A40, A100, H100 or H200 and want to join in, you are also more than welcome!
What do you need to do?
When the time is right, all that will be required is for you to turn your computer on, install our kalavai client and join the cluster. More details on this once we are ready!
What will you get in return?
Eternal kudos, your name enshrined in history (and our website) and free access to inference for as long as we can keep it running.
Our plan
First we need to cross the T's and dot the I's on our end.
In the meantime, we need enough community support. We literally cannot do this without you, so whilst things are sorted in the backend, you can sign up to show your support and join us on discord.
Before going to the major league, we probably need to play a few lower tournaments :) We will try a few smaller models ahead of the big event, likely Qwen2.5 14B, 32B and 72B versions, and Falcon 40B and 180B versons.
More details to come soon!
References
[1] https://www.substratus.ai/blog/llama-3-1-405b-gpu-requirements
[2] https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct