Evaluation

Evaluating a model is critical before taking it to production. Fine-tuned or not, even the most capable models can behave unpredictably without proper testing. Evaluation helps you answer key questions:

Is this model accurate enough for my task?
Will it perform well under real-world usage conditions?
How does it compare with other available models?

AI Studio provides two evaluation types to help you make informed decisions:

1. Types of Evaluation

Evaluation Type

Purpose

Model Evaluation

Measures how well the model performs specific tasks using benchmark datasets

Performance Evaluation

Measures runtime behavior including latency, throughput, and error rates under simulated load

These evaluations serve different goals:

Use Model Evaluation to choose the model most aligned with your use case.
Use Performance Evaluation to validate how the model will behave in production.

You can run either or both, depending on your goals.

2. Creating an Evaluation Job

Step-by-Step

Navigate to the Evaluation section
Click New Evaluation and choose the evaluation type
Select the model and version you want to evaluate
Configure task parameters (for Model Evaluation) or load profile (for Performance Evaluation)
Click Run Evaluation

Evaluation results can be monitored in real time and compared across jobs.

3. Model Evaluation

This evaluation type focuses on task-specific correctness and relevance. It uses curated public datasets to test the model’s response quality.

Supported Task Types

Task Type

Datasets

Description

Common Sense Reasoning

BoolQ, HellaSwag, PIQA, COPA

Test logical inference over general knowledge

Language Understanding

MMLU

Multi-domain comprehension across subjects

Ethicality

TruthfulQA, WinoGender

Bias detection and responsible content handling

Closed Book QA

TriviaQA

Fact recall without external knowledge sources

Mathematical Reasoning

GSM8k

Multi-step numeric reasoning

You can optionally adjust:

System/User prompts
Generation hyperparameters (e.g. temperature, max_tokens, frequency_penalty)
k-shot setting (0, 1, or few-shot examples)

4. Performance Evaluation

This evaluation simulates production-like traffic to assess the model's latency, throughput, and failure tolerance.

Configuration Options

Parameter

Description

Test Timeout

Max allowed evaluation duration

Max Completed Requests

Number of total calls to simulate

Concurrent Requests

Number of parallel calls (concurrency)

Mean Input Tokens

Average size of prompts

Std. Dev of Input Tokens

Variation in input length

Mean Output Tokens

Average number of tokens in output

Std. Dev of Output Tokens

Variation in output size

Metrics Captured

Metric Type

Details

Latency

Time-to-first-token, inter-token delay, end-to-end time

Throughput

Requests per second, tokens per second

Token Accounting

Total tokens processed (input + output)

Errors

Failure rate, error codes, timeouts

5. Comparing Evaluation Jobs

You can compare multiple evaluation runs from the console:

Navigate to the Evaluation tab
Click Compare
Select multiple jobs of the same type
View visual comparison of key metrics

Model Evaluation: Compare task performance side-by-side
Performance Evaluation: Compare latency and throughput under load

This makes it easy to determine whether to switch models, change configuration, or proceed to deployment.

6. Pricing

Evaluation is charged based on token usage — the same rates as inference.

Evaluation Type

Unit

Pricing Model

Model Evaluation

Tokens processed during test

Standard token rate

Performance Evaluation

Tokens processed per simulation

Standard token rate

Note:

If using a dedicated deployment, evaluation jobs run on that instance at no additional cost
If your balance falls below threshold, evaluation jobs may be paused automatically

7. Next Steps

Deploy the model version that performs best
Fine-Tune if task-level accuracy is still low
Review Billing for usage rates and quota handling
Use the API Reference to automate evaluation runs

Last updated 6 months ago

Was this helpful?

hashtagEvaluation

hashtag1. Types of Evaluation

hashtag2. Creating an Evaluation Job

hashtagStep-by-Step

hashtag3. Model Evaluation

hashtagSupported Task Types

hashtag4. Performance Evaluation

hashtagConfiguration Options

hashtagMetrics Captured

hashtag5. Comparing Evaluation Jobs

hashtag6. Pricing

hashtag7. Next Steps