Evaluation

Evaluation

Evaluating a model is critical before taking it to production. Fine-tuned or not, even the most capable models can behave unpredictably without proper testing. Evaluation helps you answer key questions:

  • Is this model accurate enough for my task?

  • Will it perform well under real-world usage conditions?

  • How does it compare with other available models?

AI Studio provides two evaluation types to help you make informed decisions:


1. Types of Evaluation

Evaluation Type
Purpose

Model Evaluation

Measures how well the model performs specific tasks using benchmark datasets

Performance Evaluation

Measures runtime behavior including latency, throughput, and error rates under simulated load

These evaluations serve different goals:

  • Use Model Evaluation to choose the model most aligned with your use case.

  • Use Performance Evaluation to validate how the model will behave in production.

You can run either or both, depending on your goals.


2. Creating an Evaluation Job

Step-by-Step

  1. Navigate to the Evaluation section

  2. Click New Evaluation and choose the evaluation type

  3. Select the model and version you want to evaluate

  4. Configure task parameters (for Model Evaluation) or load profile (for Performance Evaluation)

  5. Click Run Evaluation

Evaluation results can be monitored in real time and compared across jobs.


3. Model Evaluation

This evaluation type focuses on task-specific correctness and relevance. It uses curated public datasets to test the model’s response quality.

Supported Task Types

Task Type
Datasets
Description

Common Sense Reasoning

BoolQ, HellaSwag, PIQA, COPA

Test logical inference over general knowledge

Language Understanding

MMLU

Multi-domain comprehension across subjects

Ethicality

TruthfulQA, WinoGender

Bias detection and responsible content handling

Closed Book QA

TriviaQA

Fact recall without external knowledge sources

Mathematical Reasoning

GSM8k

Multi-step numeric reasoning

You can optionally adjust:

  • System/User prompts

  • Generation hyperparameters (e.g. temperature, max_tokens, frequency_penalty)

  • k-shot setting (0, 1, or few-shot examples)


4. Performance Evaluation

This evaluation simulates production-like traffic to assess the model's latency, throughput, and failure tolerance.

Configuration Options

Parameter
Description

Test Timeout

Max allowed evaluation duration

Max Completed Requests

Number of total calls to simulate

Concurrent Requests

Number of parallel calls (concurrency)

Mean Input Tokens

Average size of prompts

Std. Dev of Input Tokens

Variation in input length

Mean Output Tokens

Average number of tokens in output

Std. Dev of Output Tokens

Variation in output size

Metrics Captured

Metric Type
Details

Latency

Time-to-first-token, inter-token delay, end-to-end time

Throughput

Requests per second, tokens per second

Token Accounting

Total tokens processed (input + output)

Errors

Failure rate, error codes, timeouts


5. Comparing Evaluation Jobs

You can compare multiple evaluation runs from the console:

  1. Navigate to the Evaluation tab

  2. Click Compare

  3. Select multiple jobs of the same type

  4. View visual comparison of key metrics

  • Model Evaluation: Compare task performance side-by-side

  • Performance Evaluation: Compare latency and throughput under load

This makes it easy to determine whether to switch models, change configuration, or proceed to deployment.


6. Pricing

Evaluation is charged based on token usage — the same rates as inference.

Evaluation Type
Unit
Pricing Model

Model Evaluation

Tokens processed during test

Standard token rate

Performance Evaluation

Tokens processed per simulation

Standard token rate

Note:

  • If using a dedicated deployment, evaluation jobs run on that instance at no additional cost

  • If your balance falls below threshold, evaluation jobs may be paused automatically


7. Next Steps

  • Deploy the model version that performs best

  • Fine-Tune if task-level accuracy is still low

  • Review Billing for usage rates and quota handling

  • Use the API Reference to automate evaluation runs

Last updated

Was this helpful?