Serverless in AI

Jason Smith Jason Smith

November 30, 2024

drone-in-clouds.jpg

Unless you have been living under a rock, you probably have heard about this recent explosion in AI thanks to the launch of ChatGPT 2 years ago. While AI is by no means new. We have seen several examples of non-biological entities “thinking” in different forms throughout history in the form of literature.

The idea of AI in computing didn’t really start until Alan Turing presented the idea of the thinking machine in the 1950s. The Turing Test was created as a way to tell if a compuer can actually think. In short, the basic test was to see if a human could tell if the entity that it was chatting with could tell if it’s AI or not. If it was in fact AI but they absolutely believed that it was human, then it passed the test.

The general idea behind the test was that computers couldn’t really “think” as much as it could respond with pre-recorded content. This would present limitations in the AI to where over time, it would be obvious that you were not chatting with a human.

With GenerativeAI, we are certainly closer than we have been before. I personally don’t think that we can say that LLMs have definitively passed the Turing Test but I do think that we are headed in that direction and the next decade or so will be interesting.

But let’s not just talk about AI. Let’s talk about the role that serverless will have in AI.

Serverless Compute in AI

In the past year or so there has been a rise in what many are calling “Inference-as-a-Service”. I talk about it briefly in my the Cloud Is Serverless newsletter. Before we dive deep into what “Inferencing” is.

When we talk about AI, Machine Learning and models, we are usually talking about one of two stages, training and inferencing.

Training is the act of building a model. In short, you collect A LOT of data and then “teach” the model to do something. The end poduct is a model. Let’s say you wanted to create a model identify the type of flower in a picture. You would have to provide a massive amount of pictures and essentially “teach” the model how to identify the different flowers. The end product is a model that can now identify what flowers are in an image that is supplied to it.

Inferencing is the process of interacting with the model. In the previous example, this would be the process of running the model against the picture to tell you what flowers (if any) are in the photo. There is often some kind of “inferencing layer”. That is, there is some kind of code and/or API that you interface with and then that is what communicates with the model.

So back to Inference-as-a-Service. Many companies are trying to find ways to monetize LLMs and other models by giving their customers a way to inference with them by making a simple API call. Now from an inferencing layer standpoint, does your customer really need a service that is up 24/7 to interact?

Serverless compute is perfect for this as serverless compute spins up when needed, does it’s thing, then spins back down. In this architecture, when a customer hits the API endpoint then the serverless worker will start, inference with the LLM, return a result, and then spin down.

Many companies such as HuggingFace, Cloudflare, Vultr, and Alibaba have begun to offer serverless compute solutions for inferencing. Google Cloud recently announced GPUs for Cloud Run which can take serverless inferencing to the next level!

Serverless Databases in AI

In recent years, we have seen an explosion of serverless databases. Now serverless databases sounds a bit like an oxymoron. Unlike compute, data isn’t exactly stateless. You want that data to exist potentially forever. However, many companies have found ways to separate the compute layer of the database from the storage layer of the databases.

Traditionally, you would install a database on a VM (or series of VMs) and VMs, by their very nature, are always on. While your data is stateful and needs to be permanent but does the computer need to be always on? This is what serverless databases attempt to address.

In the age of AI, in particular Generative AI, serverless databases can play a big role. In the past, I have talked about RAG or “Retrieval Augmented Generation”. You may or may not be familiar with RAG so let’s level set.

A RAG is an architecture that leverages a vector database to store customer/properitary data to be used in Generative AI.

At it’s core, Generative AI leverages a Large Language Model (LLM). It is not cheap to train an LLM so it is a bit cost prohibitive for most organizations to create their own LLM. While foundational LLMs train on a lot of data, they don’t train on ALL data. You may have properitary data that isn’t easily accessible as it’s behind a firewall or part of the Deep Web, not to be confused with the Dark Web.

So what do I do if I want to use generative AI on my data? It’s expensive to train my own LLM and I am not giving my properitary data to a big company to train their LLM. RAG will store your custom information in a database and then when your AI Agent calls out to the LLM, it will also check against the database and give you an answer relevant to your data.

Companies such as Pinecone, Neon.tech and thenile.dev have begun to offer serverless Postgres with pgvector. The idea behind this goes with the aforementioned inference-as-a-service. I only need to interact with the RAG database when I am inferencing with the LLM so why should I pay for it when it’s idle?

Serverless Streaming in AI

Streaming is a very important part of modern application architecture. I often talked about Serverless Eventing on this very blog. Real time applications receive a lot of real time data at any given second. You need a messaging bus that can handle that in real time.

Traditionally, we have seen tools such as Kafka and RabbitMQ as well as newer contenders like Pulsar and NATS. Many companies such as Confluent(“https://confluent.io) have looked to monetize these tools and have seen great success.

However, most of the incumbent companies do not offer a “pay-as-you-go” model. You are usually having to provision VMs as workers and pay for them even as they idle.

Enter “serverless streaming”. Companies such as RedPanda and StreamNative have arrived in the marketplace offering serverless Kafka and Pulsar respectively. Granted, they are moreso backgwards compatible with the OSS APIs rather than being that product in it’s “true state” but it still works the same for the end user.

Similar to serverless databases, they separate the storage and compute layers. While you pay for storage long term, you only pay for the compute as it’s being used. This is true serverless streaming.

So where does this fit into the AI story? Well what if you want to inference using the real time data that is coming through? What if you want to use the real time data to update your vector database? This is all part of the larger architecture (one that I plan on building a demo for in Q1 of 2025, stay tuned). Streaming is a major component of Generative AI usage within modern applications and there is no reason that you can’t leverage serverless architecture.

What’s Next for Serverless AI?

It’s still pretty early to say what’s next. My general prediction is that we’ll see more and more serverless offerings to help. Google Cloud created Cloud Run with GPUs which allows you to deploy and inference with an LLM in under a minute (many times in under 10 seconds) which is unheard of thanks to their serverless architecture. You pay only for what you use in this scenario.

Historically, we would associate serverless with compute but companies like Pinecone and Redpanda are showng that data in it’s many forms can be serverless. While serverless training doesn’t exist today, I anticipate that some startup somewhere will figure out how to make that happen within the next 5 or 10 years.

The reality is that serverless is here to stay and is only becoming more ubiquitous in modern enterprise applications. The Cloud Native Computing Foundation(CNCF) lists serverless as one of the 5 computing trends to watch. And they aren’t alone in this assessment. The serverless market is expected to hit $44.7B in CAGR by 2029.

It is also needless to say that AI is exploding in an unprecedented manner. So why not combine the two technologies? Why pay for idle workers in AI if you don’t have to? Right now, paying for GPUs and stateful computer is a reality while training but inferencing is a different story.

I am excited to see what new technologies and startups pop-up as a result. If you are interested in learning more about AI in general, here are some recommended AI courses.

One more thing! I want to do is show some serverless demos and architectures on this blog so please stay tuned for more right on this blog!

Cover Image Credit to Petar Avramoski on Pexels