Expanding AI Infra beyond Nvidia with Nikhil Sonti

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Expanding AI Infra beyond Nvidia with Nikhil Sonti - Felafax, ex-Meta

There's an entire ecosystem of GPUs beyond Nvidia and Felafax is making developer tooling to make it easy to use them

Rishabh Kaul

Aug 19, 2024

Transcript

Nikhil Sonti (along with his brother Nitin Sonti) is the co-founder Felafax - a platform to train AI models on non -NVIDIA chipsets. They’re part of the ongoing YCombinator batch (YC S24). I’ve known Nikhil for a few months now - a former Meta engineer who’s worked on distributed systems - I’ve been fascinated by his journey. When I first met Nikhil - he was working on another side project - and since then has moved onto the current idea which seems super interesting.

I talk to him below about the non Nvidia GPU landscape, his view on model size, and the AL infra ecosystem. Snippets from the podcast below

The non Nvidia landscape

There are already a of accelerators which are non -NVIDIA. Like Google has this famous one called Google TPUs. And AWS is now starting with their own chip called AWS Tranium. And then there are other players like AMD has their own GPUs, Intel have their own GPUs.

Opportunity to build great developer tooling for this ecosystem

NVIDIA has over a decade of close integration with PyTorch using its CUDA kernels. So the performance per dollar is like really good there.
And if you just use any other chip like TPU or Trainium and try to run the same PyTorch code, it's not that efficient. There's so much challenges to just get it up and running because there's no CUDA API.
And yeah, where we fit in is we are trying to make it easier to run any of these popular LLM use cases like fine tuning and training or inference on these non -NVIDIA chipsets, like building the entire stack.

Why would somebody use non -NVIDIA chipsets?

NVIDIA is amazing. What they've built so far is really valuable. looking five years down the lane or something, we can't have the entire LLM infrastructure just running on one single chipset.
We have been talking with all these other players in AMD and Google TPUs.
There are amazing chips out there, most people don't want to put all their eggs in one basket, rely heavily just on Nvidia GPUs. And second is like getting availability of these chips are like incredibly hard. Like if you want like quota for 100 Nvidia H100s, good luck trying to find that in your these big clouds. And now the alternative is like, can you use like these other chipsets, which are actually 50 % cheaper to do what you want and you have like much more capacity available. So there's a lot of demand around this.
There is very little info around fine -tuning this. So that's one big value proposition from our side is we want to differentiate. And hey, you can easily go use this 50 % cheaper resource to fine -tune a large model and derive much better value for your customers than just relying on these smaller models.

How does Felafax help?

We built an entire ML ops stack so that you can quickly come like select, I want 256 TPU cores and I want to fine tune Llama 3. And it just spins up all these things and you just get a Python notebook within a few minutes, you can go fine tune that.
If you use this, you will be 2x more cost efficient. if someone is having a regular fine tuning, continuous pre -training use case or something, it's much more cheaper. the second differentiation we are thinking you can deploy this in -house in your own VPC of AWS or GCP cloud, take the benefits of the lower cost and also the performance.
We think we can match the performance of NVIDIA. secondly, access to GPUs. Even big cloud providers, it's still hard to get as many GPUs as you want. But these other chips are much more easily accessible.

Big Models or Small Models

I think one big trade off most people make today is between large models and small models. Like many go with like 8 billion instead of 7 TB or a 400 B Llama 3s because 8 billion model is much more easy to fit on like a GPU. And you don't need to orchestrate this multi GPU training and all that, which is like much more complex engineering.

What about fine tuning smaller models instead?

I mean, that seems to be the thesis right now, but people are interested in these large models and trying to figure out where they can fit in because it's one of those models which are very close to GPT -4, like Llama 400B, and they're looking at creative ways to include it in their own use cases, especially health care companies.
I think there are some use cases where getting a 405B would dramatically increase the reasoning capabilities of your model. And second, I think the other use cases is using 405B to train a smaller model - the student teacher sort of architecture. I think that's one use case which is now developing

How did Felafax come about?

Yeah, I think like a huge credit to this idea is just my brother. He spent like working in all these large tech companies building these exact same thing of distributed training infra within these teams. And over time, like he thought like, okay, there should be like easier ways to do it on other hardware too. And we had this kind of in a moment where we thought, hey, why is it not like people are not using other chipsets and is there something here we should explore?

What further customer development did you end up doing?

I think over the months, like we have ended up refining more and more. Like our initial thing was, we'll build a distributor training infra. And that was very generic, but like talking to various customers, like two biggest use cases that are existing today is like fine tuning and inference.
Yeah, I think people who are already in these have deployments in these clouds like GCP or AWS and looking for their own VPC deployment of LLM in front open source models like Llama3 or Gemini. They want to fine tune and deploy their fine tune models. I think these are the people who we are initially targeting because for them, since they're fine tuning and deploying, they need an infra which runs on their cloud. And also, the cost part makes a difference. If you're fine tuning frequently, if it's 50 % cheaper using the same provider like TPU on Google Cloud instead of NVIDIA, there is a big value add there. So those are the people who we are initially trying to target.

Will Nvidia ever make their software portable?

I doubt if they'll ever do that because like that's their single biggest differentiator against any other GPU provider. And this PyTorch is so well integrated with CUDA that the performance there is amazing. There are alternatives to CUDA called XLA, which is a compiler. Yeah, all these non -NVIDIA kind of support XLA and you can build in front of it. You can do PyTorch XLA to run on these non -NVIDIA, but...the performance is still not there. PyTorch, XLA is in nascent days and the CUDA has like amazing integration.

Doesn’t Pytorch work on TPUs? What’s Felafax’s unique value?

Yeah, like regarding pytorch supporting TPU It's on paper. It's like yeah pytorch XLA runs on TPU, but just getting it up and running on a TPU is challenging with like, or the right dependencies and the right modules and things, even after like, once you get it up and running, like the performance today is like not at par with just running PyTorch on Nvidia.
So like for most people, like it's just easier to just take their model, run it on Nvidia. There's so much more resources out there.
So our differentiator is going to be like, we want to firstly like make the PyTorch XLA work well. And second, we are building on top of this other framework called Jax, which is much more closely integrated with TPU and the entire XLA thing. And which should give us like differentiator in terms of like performance there.

Where does OSS fit into your go-to-market?

Our main thesis for open source was that we wanted to make it easier for anyone to just run. We spent two three weeks just trying to get some pytorch models running on XLA hardware
So our goal was like, it should be super easy for anyone who wants to try out non -NVIDIA hardware to come and use it, spin up their own TPUs and things and like run it.
But if they're looking for even simpler solution where we orchestrate the entire solution, they can use a cloud sort of solution, which can take scare of abstracts of even more challenges. And you just come use it for like spinning up the TPU and run the entire notebook.
And also there's so much resources for open source and developer support. I think being open source and engaging with developers might be easier to also like build out the vision and like get their feedback early on

Startups should try to get more bang for their buck when it comes to GPUs

Many of these startups get like half a million in free credits. And at that point, you're mostly not thinking so much of resources like Nvidia versus others. But eventually those credits will run out and you'll start paying from your real dollars.
The regular VM instances, like some 250k or 300k credits, last a long time. But GPUs are super expensive and also like to do fine tuning and things you need to try multiple experimentation and then run inference. And these are really expensive use cases.
So they don't last as long as you initially like plan for. It's always good to like have diverse options in the back of your mind and also like see if there is a much cheaper way to do it.

What is it like to think of an idea while you're still working. Approach one is, me just quit my job and then I'll figure out what I want to do. Your was a different approach.

I would always recommend like there's no perfect timing. I was interested in building something like even my wife was. So we just started during our free time, think of the biggest problems that users have who do we talk to and go to meetups and all these are much quite easy thing to do. when you're already working, you have like some free time and meeting where you can go interact with like amazing.
people around you and learn. would say like start off there. And if you find like, I think one thing with tech is the building is the easiest part. think anyone can go build something in two, three weeks. The hardest part is selling the, it to users and finding a problem which people really want solved. go figure out all these things. And once you nail an idea, which you think is really valuable, like go pursue on it full time.

So had you guys not gotten into YC? What, what, like, how do you view that alternate universe? What would have happened?

I think we would have still explored this idea maybe on a much slower time scale, wherein we would have tried to talk to people who want to solve. Similarly, start off some open source repo and start providing value and figure out if this is a problem that needs solved. I think YC only made us both pursue this full time and accelerate the process wherein you can dedicate all seven days a week.

So if you’re an engineer looking to get started on TPUs - give Felafax a try and keep following their journey

Note: If you’re an Indian operator in Europe thinking of starting up or an Indian founder in Europe - I’d love to chat and learn of your journey

Svagat

Expanding AI Infra beyond Nvidia with Nikhil Sonti - Felafax, ex-Meta

Discussion about this video