> For the complete documentation index, see [llms.txt](https://docs.cloud.olakrutrim.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.cloud.olakrutrim.com/basics/ai-studio/ai-jobs/evaluation.md).

# Evaluation

## Evaluation

Evaluating a model is critical before taking it to production. Fine-tuned or not, even the most capable models can behave unpredictably without proper testing. Evaluation helps you answer key questions:

* Is this model accurate enough for my task?
* Will it perform well under real-world usage conditions?
* How does it compare with other available models?

AI Studio provides two evaluation types to help you make informed decisions:

***

### 1. Types of Evaluation

| Evaluation Type            | Purpose                                                                                       |
| -------------------------- | --------------------------------------------------------------------------------------------- |
| **Model Evaluation**       | Measures how well the model performs specific tasks using benchmark datasets                  |
| **Performance Evaluation** | Measures runtime behavior including latency, throughput, and error rates under simulated load |

These evaluations serve different goals:

* Use **Model Evaluation** to choose the model most aligned with your use case.
* Use **Performance Evaluation** to validate how the model will behave in production.

You can run either or both, depending on your goals.

***

### 2. Creating an Evaluation Job

#### Step-by-Step

1. Navigate to the **Evaluation** section
2. Click **New Evaluation** and choose the evaluation type
3. Select the **model** and version you want to evaluate
4. Configure task parameters (for Model Evaluation) or load profile (for Performance Evaluation)
5. Click **Run Evaluation**

Evaluation results can be monitored in real time and compared across jobs.

***

### 3. Model Evaluation

This evaluation type focuses on task-specific correctness and relevance. It uses curated public datasets to test the model’s response quality.

#### Supported Task Types

| Task Type              | Datasets                     | Description                                     |
| ---------------------- | ---------------------------- | ----------------------------------------------- |
| Common Sense Reasoning | BoolQ, HellaSwag, PIQA, COPA | Test logical inference over general knowledge   |
| Language Understanding | MMLU                         | Multi-domain comprehension across subjects      |
| Ethicality             | TruthfulQA, WinoGender       | Bias detection and responsible content handling |
| Closed Book QA         | TriviaQA                     | Fact recall without external knowledge sources  |
| Mathematical Reasoning | GSM8k                        | Multi-step numeric reasoning                    |

You can optionally adjust:

* System/User prompts
* Generation hyperparameters (e.g. `temperature`, `max_tokens`, `frequency_penalty`)
* k-shot setting (0, 1, or few-shot examples)

***

### 4. Performance Evaluation

This evaluation simulates production-like traffic to assess the model's latency, throughput, and failure tolerance.

#### Configuration Options

| Parameter                   | Description                            |
| --------------------------- | -------------------------------------- |
| `Test Timeout`              | Max allowed evaluation duration        |
| `Max Completed Requests`    | Number of total calls to simulate      |
| `Concurrent Requests`       | Number of parallel calls (concurrency) |
| `Mean Input Tokens`         | Average size of prompts                |
| `Std. Dev of Input Tokens`  | Variation in input length              |
| `Mean Output Tokens`        | Average number of tokens in output     |
| `Std. Dev of Output Tokens` | Variation in output size               |

#### Metrics Captured

| Metric Type      | Details                                                 |
| ---------------- | ------------------------------------------------------- |
| Latency          | Time-to-first-token, inter-token delay, end-to-end time |
| Throughput       | Requests per second, tokens per second                  |
| Token Accounting | Total tokens processed (input + output)                 |
| Errors           | Failure rate, error codes, timeouts                     |

***

### 5. Comparing Evaluation Jobs

You can compare multiple evaluation runs from the console:

1. Navigate to the **Evaluation** tab
2. Click **Compare**
3. Select multiple jobs of the same type
4. View visual comparison of key metrics

* **Model Evaluation**: Compare task performance side-by-side
* **Performance Evaluation**: Compare latency and throughput under load

This makes it easy to determine whether to switch models, change configuration, or proceed to deployment.

***

### 6. Pricing

Evaluation is charged based on token usage — the same rates as inference.

| Evaluation Type        | Unit                            | Pricing Model       |
| ---------------------- | ------------------------------- | ------------------- |
| Model Evaluation       | Tokens processed during test    | Standard token rate |
| Performance Evaluation | Tokens processed per simulation | Standard token rate |

Note:

* If using a **dedicated deployment**, evaluation jobs run on that instance at no additional cost
* If your balance falls below threshold, evaluation jobs may be paused automatically

***

### 7. Next Steps

* Deploy the model version that performs best
* Fine-Tune if task-level accuracy is still low
* Review Billing for usage rates and quota handling
* Use the API Reference to automate evaluation runs


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cloud.olakrutrim.com/basics/ai-studio/ai-jobs/evaluation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
