Deployment
Deployment
Once a model is finalized—whether from the Catalog or after fine-tuning—it can be deployed as an always-available endpoint for inference.
AI Studio supports two deployment modes:
On-Demand: Instant, serverless access. Best for quick experiments and lightweight use cases.
Dedicated: Persistent deployment on dedicated infrastructure. Recommended for production-grade workloads.
1. Why Deploy?
Deploying a model creates a stable, callable API that can be integrated into downstream systems and user-facing products. It ensures:
Predictable performance
Repeatable results
Centralized monitoring
Easy access via standard APIs
For high-availability, real-time applications, deploying the model is essential.
2. On-Demand vs Dedicated Deployments
On-Demand
Serverless, ephemeral deployment managed by the platform
Ad-hoc testing, internal tools
Dedicated
Persistent instance with reserved GPU
Production systems, high-throughput APIs
3. Why Use Dedicated Deployment?
Dedicated deployments offer significant benefits over serverless access:
Guaranteed Throughput: Your model runs on a dedicated GPU (e.g., NVIDIA H100), delivering consistent latency and handling concurrent requests reliably.
Data Security: Inference runs in an isolated environment, reducing risk of data leakage. Suitable for enterprise, healthcare, and financial use cases.
Stable Endpoint: Model versioning is locked, making it easy to debug, monitor, and iterate. Ideal for applications with audit or compliance needs.
Fine-tuned Model Hosting: Use dedicated deployments to host your own custom fine-tuned models with controlled rollout.
4. Creating a Deployment
To deploy a model from Model Catalog:
Navigate to the deployment tab
Click New Deployment
Select desired model from drop down
Provide unique deplyment name
Click on deploy and the system provisions a dedicated instance and exposes an OpenAI-compatible endpoint
To deploy a fine-tuned model:
Navigate to the completed fine-tuning job
Click Deploy
Provide a unique deployment name
Click on deploy and the system provisions a dedicated instance and exposes an OpenAI-compatible endpoint
Deployment is complete within minutes and ready for use across your application stack.
5. Managing Deployments
The Deployments tab provides:
Status Monitoring: Running, stopped, or failed state
Usage Tracking: Requests, token throughput, and basic logs
Deployment Controls: Start, stop, or redeploy models
Checkpoint Selection: Redeploy older fine-tuning checkpoints if needed
Stopping a deployment releases its GPU and suspends the associated endpoint.
6. Next Steps
Run Inference using the deployed model
Evaluate latency and runtime performance post-deployment
Fine-Tune to improve domain alignment before deployment
Refer to the API Reference to integrate your deployment endpoint
Last updated
Was this helpful?

