Tutorial 👩‍🏫¶

After you go through the Quickstart section, you can follow this tutorial for a quick example.

Make sure you can run simpleval before you continue.

This can be in a python virtual environment (recommended) or as a cli tool.

The Scenario 🧾¶

You are testing whether an LLM can answer questions about a short story called "The Clockmaker's Secret"¹

You will compare how two different prompts are doing by using simpleval to evaluate and compare their results. This is done with the "LLM as a judge" technique.

Setup ⚙️¶

✅ Set LLM credentials¶

Make sure you have the credentials set for the LLM you want to use. See supported providers and their required credentials in the Judge Models and Authentication section.

✅ Init the evaluation set¶

Run the interactive init command:

1	`simpleval init`

For Enter eval folder...: enter story-q-and-a
For Enter test case name (enter to stop): enter prompt1
Select a judge provider you want to use, which ever you have access to.
For example open-ai
Don't worry if you get an error saying that the necessary credentials are not set; you can set them later. Just enter 'y' to continue.

notice

Make sure you have the credentials set for the judge model you selected before you run the evaluation, if not you will be shown the expected environment variables to set for the judge model.

Select the recommended model id to use
Select Pick your own metrics and select:
"correctness", "relevance", "completeness", "readability" (if you make a mistake, you can always edit the config.json file before you run)
Press Enter to continue.
Skip Do you want to configure concurrency by hitting enter

This will result in a folder structure like this:

story-q-and-a
├── config.json
├── ground_truth.jsonl
├── ...
└── testcases
    └── prompt1
        └── task_handler.py

⚠️ If you haven't done so already, now is the time to set the credentials for the judge model you selected.

✅ Create `story.txt`¶

We want to provide the story to our LLM as a judge

Create a file named story.txt in the story-q-and-a/testcases folder
Paste the following story into the file and save it.

The story:

The Clockmaker's Secret

In the quaint village of Greystone, an eccentric clockmaker named Elias Thorne was rumored to have crafted a clock that could alter time itself.
One day, a young woman named Clara entered his shop, searching for a gift for her father.
Among the countless clocks, she was drawn to a small, unassuming one with a single golden hand. When she asked about it, 
Elias hesitated and warned her, “It doesn’t just tell time—it listens to it. Be careful with what you wish for.”
Intrigued, Clara bought the clock, dismissing his words as a playful superstition.
That night, Clara placed the clock on her bedside table and whispered, “I wish I had more time with my father.”
The clock stopped ticking, its golden hand spinning backward before the world around her blurred.
She found herself in her childhood home, her father’s laughter filling the air.
At first, she was overjoyed to relive these moments, but soon she realized she was trapped in the past, the clock inexplicably following her wherever she went.
Panicked, Clara returned to Elias, demanding to know how to undo her wish.
The clockmaker’s face softened with sympathy as he said, “Time only grants what is asked, not what is truly needed.
To move forward, you must let go of what holds you back.”
With that cryptic advice, Clara clutched the clock tightly,
unsure if she could bear to leave the past behind—but knowing it was the only way to reclaim her future.

✅ Prepare the Ground Truth¶

Set the content of the story-q-and-a/ground_truth.jsonl file to:

{ "name": "test1", "description": "Why did Clara feel trapped after using the clock", "expected_result": "Clara initially enjoyed reliving moments with her father but soon realized she was stuck in the past and unable to return to the present. The clock’s mysterious power had granted her wish literally, but it came at the cost of her ability to move forward in life.", "payload": { "question": "Why did Clara feel trapped after using the clock?" } }
{ "name": "test2", "description": "What advice did Elias give Clara to fix her situation?", "expected_result": "Elias told Clara, “Time only grants what is asked, not what is truly needed. To move forward, you must let go of what holds you back.” This implied that Clara had to release her emotional attachment to the past to break free from its hold and return to the present.", "payload": { "question": "What advice did Elias give Clara to fix her situation?" } }
{ "name": "test3", "description": "What might the clock symbolize in the story?", "expected_result": "The clock symbolizes the passage of time and the danger of dwelling too much on the past. It serves as a reminder that while we may wish to revisit cherished memories, clinging to them can prevent us from living in the present and embracing the future.", "payload": { "question": "What might the clock symbolize in the story?" } }

These are three questions and their expected answers that the judge LLM would check against.

✅ Implement `prompt1` handler¶

Info

Below you can find code samples for implementing your plugin with the popular LLM providers. Most examples use LiteLLM (which is installed with simpleval), but there are also examples for Bedrock, and OpenAI SDK.

Copy the code below to story-q-and-a/testcases/prompt1/task_handler.py according to the LLM provider you are using.

This code is calling an LLM, asking it to answer the provided question about the story. The simple prompt is set at prompt, instructing the LLM to answer the question about the story.

OpenAIAnthropicGeminiBedrock (Sonnet)AzureVertex AIOpenAI (SDK)

import os
import logging

from litellm import completion, ModelResponse

from simpleval.consts import LOGGER_NAME
from simpleval.testcases.schemas.llm_task_result import LlmTaskResult
from simpleval.utilities.retryables import litellm_limits_retry

model_id = 'gpt-4.1-mini'
temperature = 0.7


def call_completion(prompt: str) -> str:
    print('Call to completion started')

    response: ModelResponse = completion(
        model=model_id,
        temperature=temperature,
        messages=[{
            'role': 'user',
            'content': prompt
        }],
    )

    output = response.choices[0].message.content

    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens

    print(f'{input_tokens=}, {output_tokens=}')

    return output


@litellm_limits_retry
def task_logic(name: str, payload: dict) -> LlmTaskResult:
    logger = logging.getLogger(LOGGER_NAME)
    logger.debug(f'{__name__}: Running task logic for {name} with payload: {payload}')

    story_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'story.txt')
    with open(story_file_path, 'r') as file:
        story_content = file.read()

    prompt = f'Read this short story and answer the question at the end. Story: {story_content}. Question: {payload["question"]}'

    llm_response = call_completion(prompt)
    print(llm_response)

    result = LlmTaskResult(
        name=name,
        prompt=prompt,  # This is what you sent to your llm
        prediction=llm_response,  # This is what your llm responded
        payload=payload,
    )

    return result

import os
import logging

from litellm import completion, ModelResponse

from simpleval.consts import LOGGER_NAME
from simpleval.testcases.schemas.llm_task_result import LlmTaskResult
from simpleval.utilities.retryables import litellm_limits_retry

model_id = 'claude-3-5-haiku-latest'
temperature = 0.7


def call_completion(prompt: str) -> str:
    print('Call to completion started')

    response: ModelResponse = completion(
        model=model_id,
        temperature=temperature,
        messages=[{
            'role': 'user',
            'content': prompt
        }],
    )

    output = response.choices[0].message.content

    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens

    print(f'{input_tokens=}, {output_tokens=}')

    return output


@litellm_limits_retry
def task_logic(name: str, payload: dict) -> LlmTaskResult:
    logger = logging.getLogger(LOGGER_NAME)
    logger.debug(f'{__name__}: Running task logic for {name} with payload: {payload}')

    story_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'story.txt')
    with open(story_file_path, 'r') as file:
        story_content = file.read()

    prompt = f'Read this short story and answer the question at the end. Story: {story_content}. Question: {payload["question"]}'

    llm_response = call_completion(prompt)
    print(llm_response)

    result = LlmTaskResult(
        name=name,
        prompt=prompt,  # This is what you sent to your llm
        prediction=llm_response,  # This is what your llm responded
        payload=payload,
    )

    return result

import os
import logging

from litellm import completion, ModelResponse

from simpleval.consts import LOGGER_NAME
from simpleval.testcases.schemas.llm_task_result import LlmTaskResult
from simpleval.utilities.retryables import litellm_limits_retry

model_id = 'gemini/gemini-2.0-flash'
temperature = 0.7


def call_completion(prompt: str) -> str:
    print('Call to completion started')

    response: ModelResponse = completion(
        model=model_id,
        temperature=temperature,
        messages=[{
            'role': 'user',
            'content': prompt
        }],
    )

    output = response.choices[0].message.content

    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens

    print(f'{input_tokens=}, {output_tokens=}')

    return output


@litellm_limits_retry
def task_logic(name: str, payload: dict) -> LlmTaskResult:
    logger = logging.getLogger(LOGGER_NAME)
    logger.debug(f'{__name__}: Running task logic for {name} with payload: {payload}')

    story_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'story.txt')
    with open(story_file_path, 'r') as file:
        story_content = file.read()

    prompt = f'Read this short story and answer the question at the end. Story: {story_content}. Question: {payload["question"]}'

    llm_response = call_completion(prompt)
    print(llm_response)

    result = LlmTaskResult(
        name=name,
        prompt=prompt,  # This is what you sent to your llm
        prediction=llm_response,  # This is what your llm responded
        payload=payload,
    )

    return result

import json
import logging

import boto3
from simpleval.consts import LOGGER_NAME
from simpleval.testcases.schemas.llm_task_result import LlmTaskResult
from simpleval.utilities.retryables import bedrock_limits_retry
import os


client = boto3.client("bedrock-runtime")

model_id = 'anthropic.claude-3-5-sonnet-20240620-v1:0'
accept = 'application/json'
content_type = 'application/json'

def get_claude_body_dict(sys_prompt: str, user_prompt: str) -> dict:
    body_dict = {
        'anthropic_version': 'bedrock-2023-05-31',
        'system': sys_prompt.strip(),
        'max_tokens': 8192,
        'messages': [{
            'role': 'user',
            'content': [{
                'type': 'text',
                'text': user_prompt.strip()
            }],
        },
        {
            'role': 'assistant',
            'content': [{
                'type': 'text',
                'text': '[Eager reader]'
            }],
    }
        ],
    }

    body_dict['temperature'] = 0.7
    return body_dict


def call_claude_completion(system_prompt):
    print('Call to Claude completion started')

    user_prompt = "answer the question"

    body_dict = get_claude_body_dict(system_prompt, user_prompt)
    body = json.dumps(body_dict)

    response = client.invoke_model(body=body, modelId=model_id, accept=accept, contentType=content_type)

    result = json.loads(response.get('body').read())
    input_tokens = result.get('usage', {}).get('input_tokens', '')
    output_tokens = result.get('usage', {}).get('output_tokens', '')
    output_list = result.get('content', [])

    print(f'{input_tokens=}, {output_tokens=}')

    if not output_list:
        print('empty response')
    else:
        output = output_list[0].get('text', '')

        # Note that if you include the { as the prefill, it will not be included in the response so we need it ourselves, see cookbook link:
        # https://github.com/anthropics/anthropic-cookbook/blob/main/misc/how_to_enable_json_mode.ipynb
        # output = '{' + output
    return output

@bedrock_limits_retry
def task_logic(name: str, payload: dict) -> LlmTaskResult:
    logger = logging.getLogger(LOGGER_NAME)
    logger.debug(f'{__name__}: Running task logic for {name} with payload: {payload}')

    story_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'story.txt')
    with open(story_file_path, 'r') as file:
        story_content = file.read()

    prompt = f'Read this short story and answer the question at the end. Story: {story_content}. Question: {payload["question"]}'

    llm_response = call_claude_completion(prompt)
    print(llm_response)

    result = LlmTaskResult(
        name=name,
        prompt=prompt,  # This is what you sent to your llm
        prediction=llm_response,  # This is what your llm responded
        payload=payload,
    )

    return result

import os
import logging

from litellm import completion, ModelResponse

from simpleval.consts import LOGGER_NAME
from simpleval.testcases.schemas.llm_task_result import LlmTaskResult
from simpleval.utilities.retryables import litellm_limits_retry

model_id = 'azure/gpt-4.1-mini'
temperature = 0.7


def call_completion(prompt: str) -> str:
    print('Call to completion started')

    response: ModelResponse = completion(
        model=model_id,
        temperature=temperature,
        messages=[{
            'role': 'user',
            'content': prompt
        }],
    )

    output = response.choices[0].message.content

    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens

    print(f'{input_tokens=}, {output_tokens=}')

    return output


@litellm_limits_retry
def task_logic(name: str, payload: dict) -> LlmTaskResult:
    logger = logging.getLogger(LOGGER_NAME)
    logger.debug(f'{__name__}: Running task logic for {name} with payload: {payload}')

    story_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'story.txt')
    with open(story_file_path, 'r') as file:
        story_content = file.read()

    prompt = f'Read this short story and answer the question at the end. Story: {story_content}. Question: {payload["question"]}'

    llm_response = call_completion(prompt)
    print(llm_response)

    result = LlmTaskResult(
        name=name,
        prompt=prompt,  # This is what you sent to your llm
        prediction=llm_response,  # This is what your llm responded
        payload=payload,
    )

    return result

import os
import logging

from litellm import completion, ModelResponse

from simpleval.consts import LOGGER_NAME
from simpleval.testcases.schemas.llm_task_result import LlmTaskResult
from simpleval.utilities.retryables import litellm_limits_retry

model_id = 'vertex_ai/gemini-2.0-flash'
temperature = 0.7


def call_completion(prompt: str) -> str:
    print('Call to completion started')

    response: ModelResponse = completion(
        model=model_id,
        temperature=temperature,
        messages=[{
            'role': 'user',
            'content': prompt
        }],
    )

    output = response.choices[0].message.content

    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens

    print(f'{input_tokens=}, {output_tokens=}')

    return output


@litellm_limits_retry
def task_logic(name: str, payload: dict) -> LlmTaskResult:
    logger = logging.getLogger(LOGGER_NAME)
    logger.debug(f'{__name__}: Running task logic for {name} with payload: {payload}')

    story_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'story.txt')
    with open(story_file_path, 'r') as file:
        story_content = file.read()

    prompt = f'Read this short story and answer the question at the end. Story: {story_content}. Question: {payload["question"]}'

    llm_response = call_completion(prompt)
    print(llm_response)

    result = LlmTaskResult(
        name=name,
        prompt=prompt,  # This is what you sent to your llm
        prediction=llm_response,  # This is what your llm responded
        payload=payload,
    )

    return result

import logging
import os

from openai import OpenAI, ChatCompletion

from simpleval.consts import LOGGER_NAME
from simpleval.testcases.schemas.llm_task_result import LlmTaskResult

client = OpenAI()

model_id = 'gpt-4.1-mini'
temperature = 0.7


def call_completion(prompt: str) -> str:
    print('Call to completion started')

    completion: ChatCompletion = client.chat.completions.create(
        model=model_id,
        temperature=temperature,
        messages=[{
            'role': 'user',
            'content': prompt
        }],
    )

    output = completion.choices[0].message.content
    input_tokens = completion.usage.prompt_tokens
    output_tokens = completion.usage.completion_tokens

    print(f'{input_tokens=}, {output_tokens=}')

    return output


# Implement your own retry mechanism
def task_logic(name: str, payload: dict) -> LlmTaskResult:
    logger = logging.getLogger(LOGGER_NAME)
    logger.debug(f'{__name__}: Running task logic for {name} with payload: {payload}')

    story_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'story.txt')
    with open(story_file_path, 'r') as file:
        story_content = file.read()

    prompt = f'Read this short story and answer the question at the end. Story: {story_content}. Question: {payload["question"]}'

    llm_response = call_completion(prompt)
    print(llm_response)

    result = LlmTaskResult(
        name=name,
        prompt=prompt,  # This is what you sent to your llm
        prediction=llm_response,  # This is what your llm responded
        payload=payload,
    )

    return result

✅ Implement `prompt2` handler¶

Now we will implement prompt2: a pirate who answers the question about the story.

Copy the prompt1 directory to prompt2.
Update prompt in story-q-and-a/testcases/prompt2/task_handler.py to:

prompt = f'Read this short story and answer the question at the end You must use pirate language in your responses. Story: {story_content}. Question: {payload["question"]}'

Final Directory Structure¶

Your directory structure should look like this:

story-q-and-a
├── config.json
├── ground_truth.jsonl
├── ...
└── testcases
    ├── story.txt
    ├── prompt1
    │   └── task_handler.py
    └── prompt2
         └── task_handler.py

Running the Evaluation 🏃‍♂️¶

✅ Running the eval process¶

Verify that you have the credentials set for running the LLM and LLM-as-a-judge (API key, for example)
Run for the two testcases:

simpleval run -e story-q-and-a -t prompt1

simpleval run -e story-q-and-a -t prompt2

In case of errors, look in the error logs for the test case you ran, for example:

story-q-and-a/testcases/prompt1/llm_task_errors.txt

and

story-q-and-a/testcases/prompt1/eval_errors.txt.

Fix the issues and run again. simpleval will only run the failing tests.

Tip

If for any reason you need to run everything, overwriting all previous results, use the -o flag:

simpleval run -e story-q-and-a -t prompt1 -o

⚠️ Just keep in mind that this will overwrite all previous results.

✅ Review the results¶

Go over the result reports for each prompt.
You can also run a head-to-head comparison by running:

simpleval reports compare -e story-q-and-a -t1 prompt1 -t2 prompt2

You can also create a summary report to see all testcases (you can optionally specify a primary metric to focus on, such as readability):

simpleval reports summarize -e story-q-and-a [--primary-metric readability]

Your LLM as a judge should detect the pirate language in the second prompt and score it accordingly, usually under the readability metric, as you can see here:

Figure: Comparison Report highlighting the readability metric

Congratulations!

🎉 You have completed the tutorial. Go evaluate some real stuff! 🎉

Troubleshooting 🕵️‍♂️¶

Evaluation finished with errors: Check the llm_task_errors.txt and eval_errors.txt files in the test case folders (story-q-and-a/testcases/prompt1/ and story-q-and-a/testcases/prompt2/).
Transient errors: In case of transient errors like rate limits or rare parsing errors, you can simply run the evaluation again; only the failed test cases will run.
Verbose logging: You can always run with -v to get more verbose output.

"The Clockmaker's Secret" is a silly story generated by ChatGPT. ↩