Inferencing
Inferencing
Once you've identified a model from the Catalog, you can run inference either directly through the Playground or by integrating via API. Inference allows you to generate outputs using prompts, uploaded files, or other input types depending on the model's modality.
1. How to Run Inference
Option 1: Playground (No Code)
Every model in the Catalog includes a Playground tab. This UI lets you:
Enter prompts (for text models)
Upload files (for speech/image models)
Adjust generation parameters (
temperature,top_p,max_tokens, etc.)View results inline
This is the fastest way to test a model before moving to production.
Option 2: API via Starter Code
Use the Starter Code tab to copy cURL or Python code that calls the Krutrim API directly. The code includes:
Proper API endpoint
Pre-filled
modelidentifierDefault prompt structure
Optional generation parameters
You only need to plug in your API key and input data.
2. Integration Options
Krutrim inference APIs are OpenAI-compatible, making integration seamless with many open-source tools and SDKs.
OpenAI SDK (Python)
Krutrim supports the openai Python SDK for text models:
from openai import OpenAI
client = OpenAI(
api_key="your_krutrim_key",
base_url="https://cloud.olakrutrim.com/v1"
)
response = client.chat.completions.create(
model="krutrim-1",
messages=[
{"role": "user", "content": "Explain quantum entanglement simply."}
]
)
print(response.choices[0].message.content)Langchain Integration
Krutrim models can also be used in Langchain through OpenAI-compatible wrappers.
/from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(
openai_api_key="your_krutrim_key",
openai_api_base="https://cloud.olakrutrim.com/v1",
model_name="krutrim-1"
)
llm.predict("What are some use cases of LLMs in finance?")
This enables integration with Langchain chains, memory, tools, and agents.
3. Supported Parameters
You can control generation behavior using the following parameters:
temperature
Controls randomness (lower = deterministic, higher = more creative)
top_p
Controls nucleus sampling probability mass
max_tokens
Maximum number of tokens to generate
frequency_penalty
Penalizes repeating tokens
presence_penalty
Encourages introducing new topics
logit_bias
Biases probability of specific tokens
stop
Token(s) at which generation should stop
stream
Enables token-by-token streaming
Defaults vary by model and can be overridden via Playground or API.
4. Tokenization and Output
Each model uses its own tokenizer, which is applied automatically.
You are charged per input + output tokens, based on the model's pricing.
Refer to the Billing page for detailed rates and token limits.
5. Troubleshooting Inference
Output is cut off
max_tokens is too low
Increase the max_tokens value
Output is repetitive
Low temperature or no penalties
Raise temperature or apply frequency_penalty
High latency
Large model or long prompt
Use a smaller model or reduce prompt size
Invalid model error
Incorrect model name
Copy exact model string from the Model Card
Authentication failed
Missing or expired API key
Regenerate your API key in the Krutrim Console
6. Next Steps
Fine-Tune a model for improved domain alignment
Evaluate model quality and latency metrics
Deploy a model as a persistent, production-ready endpoint
For API endpoint details and parameters, visit the API Reference.
Last updated
Was this helpful?

