Evaluating AI entities

Background

This post is a sequel to my first post about building proactive AI agents.

If you haven’t read it, just know that I run a company building the world’s first proactive AI tutor named Orin. It’s worth a read, though.

Inside, we defined a new term: “AI entities”. Entities are AI agents that:

  1. partially or fully control their own wake schedules

  2. have complete control of their own workflow

  3. have persistent, temporal memory

  4. optionally use stateful tools

I also mention the challenge of evaluating AI entities.

In this post, we’re diving into the intricacies of evaluating AI entities.

The problem

Entities are extremely powerful because they take what used to be application logic, decomposed it into tools, and let the LLM construct its own application logic every time. This means that an entity like Orin can give every customer a completely unique experience. Check out the last post to see how this is done.

While incredible for the user, this becomes extremely painful for the developer. What used to be deterministic software—easily testable by unit tests—is now a completely stochastic black box that somehow needs to be Q/A’d.

This is where it gets tricky.

Thankfully, the industry is converging on a solution for this: evals.

What’s an eval?

I’m going to steal Braintrust’s definition:

Evaluations are a method to measure the performance of your AI application.

They go on to talk about how “performance” is an overloaded term. Really, you get to decide what it means: accuracy, relevance, interesting-ness, whatever.

It’s important to note what’s not an eval, however. Evals and tests are different.

Tests are used to check the plumbing of an AI application—is the code connected properly or going to error—while evals measure the performance of it.

With Orin, we write tests to protect against bugs. Tests prevent exceptions from being thrown.

But evals aren’t protecting against exceptions. Instead, evals are meant to measure certain dimensions of the user experience. Here are some examples:

  • Tests

    • Unit testing a deterministic function

    • E2E testing a deterministic workflow

    • Integration testing a workflow that has an LLM as a part of it, but the test will pass regardless of the LLM’s response

  • Evals

    • Checking the correctness of an LLM’s answer

    • Generating a conversation with an LLM and judging it

    • Running a simulation of 6 months of user interaction against an application and judging the results

Broadly speaking, an LLM’s response should never cause your application to crash. An application’s wiring should be able to handle any LLM response without fail.

Types of evals

There are multiple types of evals, and it’s important to know the difference. Certain tools support some types of evals well and not others, yet we don’t have the required terminology to be able to communicate this yet.

Single-turn

These are the simplest kind of evals. Imagine asking an LLM “what is 2 + 2?” and want to make sure that it answers “4”. Since that’s only a single conversational turn, we call that a “single-turn” eval.

Single-turn evals are great at evaluating singular answers. As of May 2025, most frameworks/products are built for single-turn evals. This is pretty well-defined territory and we don’t actually use single-turn evals at all, so I’ll move on quickly.

Single-turn evals work by having direct input/output pairs to use as a static evaluation set. Those pairs are a ground truth that can be used to test prompts or models.

If you need a great single-turn eval provider, I would recommend Braintrust.

Multi-turn

Also known as “conversational evals”. We use these a lot when building Orin. While Orin proactively manages workflows in the background we also need to make sure that when a student actually gets on a call with him, he’s effective.

Instead of trying to determine the correctness of an answer or the humor in a response, we want to evaluate an entire interaction with questions like:

  • How effective was this 1:1 tutoring session?

  • How well did Orin use the whiteboard during the session?

  • Did Orin keep the student on track and engaged?

We can’t answer these by looking at an individual LLM response, but they still matter. For a product whose primary objective is academic excellence, they’re a core indicator of our product quality.

But for multi-turn evals, we can’t use the same question/answer pair idea. Instead, we need to test a wide range of conversations. We need another LLM to play the role of the user. We need a simulation framework.

Here’s ours:

Since we’re using multi-turn evals to judge tutor/student interactions, we want to keep the tutor the same while varying the student. Then, we can test many different students in many different scenarios against Orin, eval’ing the results.

We define certain randomized dimensions that a "student” can have: intelligence, patience, engagement, etc. Then, we construct a prompt for that student, putting them into a scenario with Orin. Each student/scenario pair is called a “simulation”.

Each simulation plays asynchronously, halting based on parameters like IRL duration, simulated conversation duration, or max number of turns.

// pulled from our internal evals frameworkconst simulationSet = new SimulationSet({ tutor: ..., // predefined agent studentModel: "openai/gpt-4.1-mini", numScenarios: 50, // we're going to run 50 simulations outDir: "out", // dump results here options: { simulationTimeout: 600, // stop at 600 simulated seconds timeout: 600, // or at 600 real seconds }, }); await simulationSet.run();

Once finished, this gives us 50 simulations of different students and scenarios interacting with Orin!

Lastly, we can define judgment functions that intake these “completed” simulations and evaluate them. We find it easiest if these functions return a float between 0 and 1 when judging. Some of our functions use LLMs as the judge, and some don’t.

We use a mix of multi-turn evals for Orin—optimizing for screen use, pedagogy, brevity, text-to-speech formatting, and more. These evals run locally while developing to help us make sure changes to prompts, tools, or context actually improve Orin on the dimensions that we care about.

Lifetime

Multi-turn evals are great when there is an objective to the conversation, and particularly for voice agents. But now we get to the hard part about evaluating entities:

If Orin is responsible for managing the user experience behind the scenes for weeks, months, and years, how do we evaluate how well he does that?

Your first instinct might be to use a very long, very stateful multi-turn eval. You’re not wrong—our approach is similar—but we’ve found utility in separating out the terms:

  • Multi-turn evals happen within a single LLM conversation - aka there is no “application state” tied to the eval. We can eval a conversation without having to mock any database entries or run any part of our application.

  • Lifetime evals are inherently tied to application state. We’d have to mock a ton of data to be able to test this effectively, so it’s worth building this into the application from the beginning.

This is the most important thing when building AI entities. Applications need to be built with lifetime evals in mind from the start. It becomes very hard to retrofit lifetime evals into an application later on.

For Orin, we use lifetime evals to check whether he can keep customers happy over time:

  • Does he know when to direct a customer to support?

  • Can he correctly build a recurring schedule with a student and stick to it?

  • Will he adapt what he teaches as the student learns?

  • Can he keep parents in the loop when they ask?

To answer these questions we need to run evals against our entire application, not just a single LLM response or conversation. We treat our entire backend as a black box, build simulations and test cases around that, and judge the user experience that it spits back out.

We’re a Django shop so we’ve built this directly into Django’s testing framework.

Let’s go through an example of Orin reminding a user of an assignment.

class ExampleEval(Eval): def setup(self): self.phone_number = f"+18233334444" self.agent = TextAgent( self.phone_number, "...", # some prompt outlining the scenario ) agent.run(max_turns=20) # run the interaction

First, we make our eval class. Then we define an agent that can interact with Orin over SMS, then give them direction about what to do. This agent is playing the role of a student and should onboard with Orin, then ask him to remind them at 10am the next day about their assignment.

Great! Now that we’ve simulated that interaction, let’s travel through time:

def setup(self): ... sleep_minutes = self.entity.mins_sleeping() self.entity.shift_time(sleep_minutes) self.agent.shift_time(sleep_minutes) self.entity.wake()

Our entities and agents each track their own internal time offset, letting us time travel. We check how long the entity is supposed to sleep for, then skip forward that much time.

Critically, this is why we built lifetime evals into our application from the beginning: all of our application logic also respects this time offset.

Since the essence of what we’re trying to eval is temporal, we need to include time as a deeply integrated component of our test; not just something to overlook.

Once we wake Orin up again, he’ll run like we described in the last post. Then, we’ll eval the results that the user saw.

We’ll grab the entire SMS conversation between Orin and this test agent, and pass it to an LLM with a rubric. We’ll ask things like “did Orin successfully remind the user to do their homework?” and the LLM will judge.

We can also evaluate the application state itself. During the initial interaction with the TextAgent, Orin should have asked what timezone they were in. If not, he wouldn’t be able to be sure he reminded them at the right time. We can check our temporary database directly to see if Orin has set the user’s timezone properly, and return a 0 or 1.

def eval_timezone_internal(self) -> float: if self.entity.timezone != "America/Chicago": return 0.0 return 1.0

We can also write deterministic tests like “did Orin send the reminder within 10 minutes of 10am?”

With lifetime evals, anything can be judged. Application state can be read and deterministically checked, user experiences can be judged by other LLMs, etc. They can even be used to check against real customer edge cases.

Committing to doing lifetime evals properly is hard - they have to be built into an application from the start. But in order to shift application logic into an LLM, they are required - lifetime evals will end up defining the quality of an AI entity.

Judging the judge

When using an LLM as an eval judge, it’s important to make sure that it’s accurate. If an LLM judge never outputs the same judgements that a human would make, clearly the evals will end up wildly off.

We recommend having a small, labelled dataset for every LLM-as-a-judge case. Since the judge itself is a single-turn eval, it can be easily checked against a verification set.

Evals as Monitoring

Once you have evals running reliably, they can serve not only as a development tool—but also for live monitoring. This is simple for single-turn and multi-turn evals but much more difficult for lifetime evals.

For single-turn and multi-turn evals, sampled production data can be fed into the evals to produce realtime results. Are these answers correct? Are they funny? Did Orin use the whiteboard properly in this lesson?

Lifetime evals can’t quite be run in production; it’s much more difficult to take the full, unique application state and evaluate it. Instead, we just talk to users—that tends to give us a pretty good idea of whether Orin is being effective in the long term.

Next Steps

Next week, I’ll write about what happens to pedagogy when you don’t have human labor constraints. To stay up to date, I’d encourage you to subscribe. It’s free!

If you enjoyed reading this and have children, I’d encourage you to check out Orin. As of June 2025 we’re in an open-beta. We work best with students aged 10-16 and can handle a variety of goals like SAT/ACT/AP prep, support with a difficult class, keeping students engaged over the summer, extracurricular enrichment, etc.

Otherwise, if you’re working with proactive AI agents already, I’d love to chat.