Nov 20, 2024

Read time:

6 min

AI at Scale: Managing Cloud Multi-Tenant AI Infrastructure with Temporal

Every organization is searching for ways to improve their applications by utilizing large language processing models, or LLMs. If configured correctly and connected to OCR or image/audio generation tools, LLMs can solve very complex problems and further improve our interaction with software systems.

However, leveraging those features has costs, and they can climb quickly.

When building integrated applications like these, there are a lot of moving pieces to account for, including different providers and systems with different cost models. For example, when using RAG, we need to consider storage, document analysis, vector store usage, and costs. We also need to keep track of tokens used as that number can grow exponentially when complex multi-step RAG agents are in play.

Those problems multiply if we talk about multi-tenant solutions with complex deployment models and flexible infrastructure. How do we retain the ability to replace any service or model with an analog from another vendor?

At TechFabric, we tried different approaches to infrastructure management, found one that we feel strongly about and that best suits our needs.

Let’s review this and share our experience and findings.

First, let's outline what infrastructure components a typical application needs:

Application hosting
Blob Storage
Vector store
OpenAI service
Document Intelligence
Database
Observability platform (Kibana/ApplicationInsights/Langfuse)

In a multi-tenant environment, we can't simply create a new set of resources for every tenant. There are variables to consider and review with the business. Some organizations would prefer lower costs over a secure and isolated environment. Some would prefer to pay more to ensure isolation so none of their resources are shared with other tenants.

We strive to cover the majority of scenarios and edge cases to make all our applications available to the widest possible audience. On this mission, we planned, created, and tested the following hybrid approach to host multi-tenant applications.

Here’s what we found, after multiple deployments.

Hybrid Infrastructure

To make the infrastructure as cost-effective as possible, we decided to have a multi-model, multi-cloud infrastructure management solution.

At the core of a system, we have so-called "Platform" module - a Kubernetes cluster that runs our main application and controls all other subsystems.

Then, we have multiple "Infrastructure Tiers" based on the client's needs:

Shared Resources Tier

This tier is dedicated to users who prefer lower costs and are okay with the fact that their information is hosted on the same resources as other clients. Of course, data is still separated on the application level: all clients have their unique keys, and data is stored in separate folders (in blob storage) or in separate indexes (Azure AI Search).

A shared tier does not mean that the application has only one vector store or a database. Every cloud has quotas or limitations for every instance of a particular service. For example, one Azure AI Search instance has a certain number of indexes in its higher tier and it's not possible to go above that.

That means that in shared tier, every resource type should be monitored for reaching allowed limits, and, when we reach, let's say, 70% of the limit, we have to deploy a new resource, record it in the database and make sure all new registrations will use this new resource. This technique is called partitioning, but we try to avoid this because it's easy to confuse with service-level partitioning on Azure search/databases, etc.

In addition to having multiple copies of shared resources, we also need to monitor existing resource usage - most of the space may be occupied by tenants who are not active anymore and we want to move all data to less-performant resources.

So, from infrastructure-related tasks, we can identify two:

Deploying new resources automatically based on metrics
Moving data between multiple instances or resources.

Tenant-Specific Resources Tier

The tenant-specific tier is for tenants that require high-security standards and need their data to live in a separate environment. This means that every time we have a new tenant, we need to deploy a full set of required services: from blob storage to observability tools. More than that, we need to allow customers to move from a shared hosting model to an isolated one without losing their data.

If the shared resources deployment case is just about deploying a new instance of a resource and adding it to the database so that it can be used by the application, with tenant-specific resources the deployment model is drastically more complex.

We need to deploy the whole infrastructure and that means it can't be done just by executing a bicep or cloud formation template. Here, we need not only the resources but also a Lang Fuse and Kibana deployed to Kubernetes cluster via Helm, k8s ingress configured, and deployment pipelines properly configured with tenant-specific access keys and connection strings. The process has multiple steps, some of them taking more than an hour to complete and is very error-prone when running from developer's laptop.

Our previously defined infrastructure tasks list could be adjusted:

Multi-step deployments where steps can be: bicep/cloudformation templates deployments, use of AZ or AWS Cli tools, running bash scripts, applying SQL migrations, etc. ‍
Data between multiple instances stays here.

Putting it all together

Based on the tasks mentioned above, the system also needs a way to constantly monitor itself and react to different events such as resources usage or new tasks coming from other subsystems. It also needs a way to run those tasks in a reliable manner, retrying if something fails.

But "reliability", when it comes to infrastructure deployments, does not simply mean that the system should retry failed steps if something goes wrong.

It is not a regular scenario with http calls where one would simply retry failed requests until it eventually succeeds (with some back-off policy, if defined). With infrastructure we can't avoid human interactions (especially when deploying to client's controlled environment) - many things could break the deployment process: we may reach azure subscription quota on VM CPUs, it may appear that some of LLMs are not available in specific region anymore, scripts may deploy Kubernetes cluster successfully, but break when configuring the ingress controller because client's infrastructure does not allow creating DNS records and has special requirements for certificate management.

So, it is not only about making sure there is a retry mechanism, but also about ensuring at any point of time when the error happens, it is easy for a human to adjust things, change parameters, and re-run failed steps from the point where it left off.

Temporal framework has these capabilities, and then some.

It stores the whole history of all actions performed, it is possible to re-run a workflow, AND Temporal provides a slick dashboard where developers can see all workflows, their parameters, re-run them from desired steps if needed.

Infrastructure Architecture Diagram Displaying Temporal_Orchestration Changes In Different_ zure Subscriptions — Infrastructure Diagram

We found that writing infrastructure-related management code to be much easier with temporal: we don't need to deal with huge powershell or bash scripts anymore – we just write temporal actions in c# and use powershell/bash only to execute certain task (be it `az deployment group create... ` or `pg_dump ...`.

Temporal actions can deploy resources, query azure for certain parameters (like getting managed identity id and passing it as a parameter to next action). And all those commands and their output is stored in a nice step-by-step event history on Temporal dashboard!

Because temporal workers run on kubernetes pods hosted in our environment, we have full control on security – pods running Temporal workers have special Managed Identity assigned to them and those identities have special permissions on target subscriptions.

To put it simply, Temporal makes managing infrastructure for multi-tenant platforms easier than ever before. If you are mired in constantly rolling out changes to vast number of tenants and making sure all tenants have stable and functional environment, then try Temporal and see how it works for you.

We may sound biased to Temporal, because we are! We’ve tried many other durable/resiliency frameworks, and in our world, there is nothing quite like Temporal’s ease-of-use, especially in complex, multi-tenant environments where ongoing management and observability is key to sustained success.

Want to learn more? Check out our Airline Booking Demo and see Temporal in action!

AI at Scale: Managing Cloud Multi-Tenant AI Infrastructure with Temporal

Hybrid Infrastructure

Shared Resources Tier

Tenant-Specific Resources Tier

Putting it all together

What To Read Next

Interested in learning more about this topic? Contact our solution experts and setup a time to talk.

AI at Scale: Managing Cloud Multi-Tenant AI Infrastructure with Temporal

Hybrid Infrastructure

Shared Resources Tier

Tenant-Specific Resources Tier

Putting it all together

Read This Next

What To Read Next

Interested in learning more about this topic? Contact our solution experts and setup a time to talk.