Evaluation
Evaluation
Evaluating a model is critical before taking it to production. Fine-tuned or not, even the most capable models can behave unpredictably without proper testing. Evaluation helps you answer key questions:
Is this model accurate enough for my task?
Will it perform well under real-world usage conditions?
How does it compare with other available models?
AI Studio provides two evaluation types to help you make informed decisions:
1. Types of Evaluation
Model Evaluation
Measures how well the model performs specific tasks using benchmark datasets
Performance Evaluation
Measures runtime behavior including latency, throughput, and error rates under simulated load
These evaluations serve different goals:
Use Model Evaluation to choose the model most aligned with your use case.
Use Performance Evaluation to validate how the model will behave in production.
You can run either or both, depending on your goals.
2. Creating an Evaluation Job
Step-by-Step
Navigate to the Evaluation section
Click New Evaluation and choose the evaluation type
Select the model and version you want to evaluate
Configure task parameters (for Model Evaluation) or load profile (for Performance Evaluation)
Click Run Evaluation
Evaluation results can be monitored in real time and compared across jobs.
3. Model Evaluation
This evaluation type focuses on task-specific correctness and relevance. It uses curated public datasets to test the model’s response quality.
Supported Task Types
Common Sense Reasoning
BoolQ, HellaSwag, PIQA, COPA
Test logical inference over general knowledge
Language Understanding
MMLU
Multi-domain comprehension across subjects
Ethicality
TruthfulQA, WinoGender
Bias detection and responsible content handling
Closed Book QA
TriviaQA
Fact recall without external knowledge sources
Mathematical Reasoning
GSM8k
Multi-step numeric reasoning
You can optionally adjust:
System/User prompts
Generation hyperparameters (e.g.
temperature,max_tokens,frequency_penalty)k-shot setting (0, 1, or few-shot examples)
4. Performance Evaluation
This evaluation simulates production-like traffic to assess the model's latency, throughput, and failure tolerance.
Configuration Options
Test Timeout
Max allowed evaluation duration
Max Completed Requests
Number of total calls to simulate
Concurrent Requests
Number of parallel calls (concurrency)
Mean Input Tokens
Average size of prompts
Std. Dev of Input Tokens
Variation in input length
Mean Output Tokens
Average number of tokens in output
Std. Dev of Output Tokens
Variation in output size
Metrics Captured
Latency
Time-to-first-token, inter-token delay, end-to-end time
Throughput
Requests per second, tokens per second
Token Accounting
Total tokens processed (input + output)
Errors
Failure rate, error codes, timeouts
5. Comparing Evaluation Jobs
You can compare multiple evaluation runs from the console:
Navigate to the Evaluation tab
Click Compare
Select multiple jobs of the same type
View visual comparison of key metrics
Model Evaluation: Compare task performance side-by-side
Performance Evaluation: Compare latency and throughput under load
This makes it easy to determine whether to switch models, change configuration, or proceed to deployment.
6. Pricing
Evaluation is charged based on token usage — the same rates as inference.
Model Evaluation
Tokens processed during test
Standard token rate
Performance Evaluation
Tokens processed per simulation
Standard token rate
Note:
If using a dedicated deployment, evaluation jobs run on that instance at no additional cost
If your balance falls below threshold, evaluation jobs may be paused automatically
7. Next Steps
Deploy the model version that performs best
Fine-Tune if task-level accuracy is still low
Review Billing for usage rates and quota handling
Use the API Reference to automate evaluation runs
Last updated
Was this helpful?

