Deployment

Deployment

Once a model is finalized—whether from the Catalog or after fine-tuning—it can be deployed as an always-available endpoint for inference.

AI Studio supports two deployment modes:

  • On-Demand: Instant, serverless access. Best for quick experiments and lightweight use cases.

  • Dedicated: Persistent deployment on dedicated infrastructure. Recommended for production-grade workloads.


1. Why Deploy?

Deploying a model creates a stable, callable API that can be integrated into downstream systems and user-facing products. It ensures:

  • Predictable performance

  • Repeatable results

  • Centralized monitoring

  • Easy access via standard APIs

For high-availability, real-time applications, deploying the model is essential.


2. On-Demand vs Dedicated Deployments

Mode
Description
Use Case

On-Demand

Serverless, ephemeral deployment managed by the platform

Ad-hoc testing, internal tools

Dedicated

Persistent instance with reserved GPU

Production systems, high-throughput APIs


3. Why Use Dedicated Deployment?

Dedicated deployments offer significant benefits over serverless access:

  • Guaranteed Throughput: Your model runs on a dedicated GPU (e.g., NVIDIA H100), delivering consistent latency and handling concurrent requests reliably.

  • Data Security: Inference runs in an isolated environment, reducing risk of data leakage. Suitable for enterprise, healthcare, and financial use cases.

  • Stable Endpoint: Model versioning is locked, making it easy to debug, monitor, and iterate. Ideal for applications with audit or compliance needs.

  • Fine-tuned Model Hosting: Use dedicated deployments to host your own custom fine-tuned models with controlled rollout.


4. Creating a Deployment

To deploy a model from Model Catalog:

  1. Navigate to the deployment tab

  2. Click New Deployment

  3. Select desired model from drop down

  4. Provide unique deplyment name

  5. Click on deploy and the system provisions a dedicated instance and exposes an OpenAI-compatible endpoint

To deploy a fine-tuned model:

  1. Navigate to the completed fine-tuning job

  2. Click Deploy

  3. Provide a unique deployment name

  4. Click on deploy and the system provisions a dedicated instance and exposes an OpenAI-compatible endpoint

Deployment is complete within minutes and ready for use across your application stack.


5. Managing Deployments

The Deployments tab provides:

  • Status Monitoring: Running, stopped, or failed state

  • Usage Tracking: Requests, token throughput, and basic logs

  • Deployment Controls: Start, stop, or redeploy models

  • Checkpoint Selection: Redeploy older fine-tuning checkpoints if needed

Stopping a deployment releases its GPU and suspends the associated endpoint.


6. Next Steps

  • Run Inference using the deployed model

  • Evaluate latency and runtime performance post-deployment

  • Fine-Tune to improve domain alignment before deployment

  • Refer to the API Reference to integrate your deployment endpoint

Last updated

Was this helpful?