Building proactive AI agents

Background

I recently started building Orin, an AI-powered tutoring company. We wanted take the long shot of AI in education: actually building a tutor for every student.

There are plenty of “AI tutors” on the market already, but none of them seem to actually be tutors. US consumers spend $25B on tutoring every year, yet no amount of “make flashcards from your lectures” seems to replace that market. We wanted to know why.

So, we talked to parents and found the difference:

Tutoring is a service, not a tool. Families look for remedial support, accelerative practice, or exam prep, and tutors will take those goals seriously, adapt to every family’s needs, and deliver a truly custom long-term tutoring engagement.

Great tutors feel responsible for a student’s success. They take the burden of academics off of a parent’s shoulders and onto their own.

So, we built Orin to do just that.

This writing outlines what we’ve changed about the tradition “AI agent” architecture along the way.

Being Proactive and Dynamic

When using AI agents, they still feel like tools. Most AI agent companies in 2025 are building static workflows, triggered by users, that handle unstructured data.

That wasn’t going to work for Orin.

The point of hiring a tutor is so that you don’t have to manage your student’s learning. Orin needs to proactively manage, not reactively answer.

We needed a combination of workflows and triggers that allowed Orin to be proactive and dynamic, not stagnant and static.

Here were our iterations, in order.

❌ User-triggers + static workflows

This is where most AI companies are today. Trigger some workflow on a user action.

This comes with serious drawbacks - if the customer never engages, then Orin is totally stagnant. Every customer gets the same static workflow.

❌ Beat schedule + static workflows

What if we made a beat schedule? For example, Orin needs to plan out lessons before they happen. We can beat every hour and check which lessons are coming up, then have another LLM plan them:

You can add as many hard-coded workflows as you want, like lesson reminders. But it’s a static beat schedule, so every customer gets the same agent logic.

Orin is now somewhat proactive (doesn’t require the user to press a button), but we’re still offering the same workflows to every customer.

Beat schedule + LLM-decided workflows

Let’s keep the beat schedule, but instead let an LLM decide what workflows to run:

Now we’re cooking with gas!

This solves the stagnation problem and works better than static workflows. With this approach, an LLM is deciding the workflow for each customer.

Unfortunately, while it can handle some customer edge cases, it still relies heavily on exposing the right workflow options to the LLM.

❌ Beat schedule + LLM with tools

Continuing with the beat schedule, let’s give the LLM the tools to create its own workflows.

This is starting to work! With the right tools, Orin can give every customer a completely tailored experience.

We’ve solved the stagnation issue and we’ve given customers a dynamic experience, but the downside is that this is really expensive. Before, we were doing one inference call for each customer - now we have an uncapped amount. The cost per customer is really adding up, and this approach needs a even smarter, more expensive model.

Beating every hour wasn’t fast enough for Orin to catch everything. We ended up creeping lower on beat frequency, using more tokens.

This exposes another problem: we don’t actually need Orin to run every hour - most times he’ll wake up and have nothing to do. But we need a frequent enough beat schedule so things don’t fall through the cracks. How can we fix this?

🟡 LLM schedule + LLM with tools

At the end of every workflow, what if we force the LLM to decide how long it should “sleep”? This way, the agent can manage its own schedule.

An every-minute cron wakes up the agents that need to wake—like an alarm clock—and uses orders of magnitude less tokens while getting similar results.

But, this has one big drawback. What happens when we actually want to trigger something from a user action? For example, whenever a customer sends Orin a text, he should respond - even if he’s sleeping for another 6 hours.

LLM schedule + LLM with tools + Wake events

Let’s define some events that can wake Orin up, like a new text message:

This works well. We’ll need to add some logic to avoid running Orin twice at the same time, but this pseudocode is mostly complete.

Now, Orin can decide when he wants to run, providing a completely dynamic experience for every user. He consumes minimal compute, can build his own workflows from his tools, and we can still wake him from user events.

However, we’ve found that this approach can only work with reasoning models, and works best with o3.

I encourage you to try this architecture out and note any emergent behaviors you see :)

Tool Framing

In the last section, I glossed over what tools we’ve provided, and how. Turns out, this is incredibly important. Our key learning here is to frame the agent’s tools so that they mirror the tools that you (yes, you, the reader) use.

If you make the agent’s tools look like normal human tools, then data on the proper use those tools falls in-sample for most models.

Here’s an example. You could expose this tool to an LLM:

To you and I, this seems pretty straight-forward; we’ve probably got a table somewhere that stores email records, and this tool can list those records.

But what if, instead, you exposed this tool:

We’ve only really changed the name. But to an agent like Orin, this tool will make a lot more sense. The model powering Orin has many examples in-sample for when and why people check their email inbox.

As you build proactive, dynamic agents, this pattern becomes more and more clear. We should be thinking from our agent’s perspective, not our engineer’s perspective.

For example, Orin has his own SMS inbox that he can read and send SMS messages from, and a contact book to keep track of phone numbers, emails, and more. He also has his own calendar, with all of the standard operations available.

When naming or describing the use of tools, we refer to them as his version of normal human tools: “manage events on your calendar”, “keep track of people in your contact book”, etc. If you don’t do this, models have to figure out how exactly these tools should be used—in addition to using them well in the first place—leading to more surface area for mistakes.

Memory

So, we’ve gotten Orin to be proactive, dynamic, and we’ve given him the right tools for the job. Great! We can run him right now and he’ll be able to do some pretty cool things, but he’s missing one key component: memory.

As you can imagine, memory is pretty critical for managing a student’s learning over time.

We want Orin to have an accurate picture of how a student has evolved over time, so that he can reasonably suggest pathways for success, triage issues, and garner expertise.

With memory, we initially evaluated a few approaches:

  1. ❌ We could pack everything into context and slice off anything that didn’t fit. It’s simple, effective for a bit, and not over-engineered. However, this obviously doesn’t scale past a certain number of interactions.

  2. ❌ We could outsource this to products like mem0, Zep, Letta, etc. However, none of these solutions really fit our use case, so we opted to build our own.

    1. The main shortcoming was that most products are built for B2B use cases, where temporal memory is less important. Instead, they tend to prioritize relational memory via graphs. While this is great for remembering user preferences, it’s much worse for temporal data. The most important part of memory for Orin is to understand and remember how a student is growing over time - existing solutions weren’t cut out for this.

    2. We also wanted to control exactly what goes into memory and what doesn’t. Just like tool framing, having precise control over this is really important to the performance of Orin over time as context stacks up.

  3. ✅ We could build our own “temporal” agent memory. We opted for this approach, given the shortcomings of the others.

For Orin’s memory, we needed to be able to store varying degrees of resolution. For example, it’s totally fine if Orin can’t remember exact lessons plans from a year ago. Instead, he should remember big events in the student’s life (what were we studying for?), their progress from a year ago (what have they improved on?), and other similar facts.

We ended up building what we call “decaying-resolution memory”, or DRM.

As Orin wakes and operates, he generates chat message objects via LLM calls. We store these directly in a short term memory store, and read from this on every LLM call:

If we stopped here, Orin’s “tokens in memory” would scale linearly with the number of interactions he has.

Let’s improve that.

When Orin goes back to sleep, we enshrine his memories into a long-term memory store. We use an LLM to summarize the short-term memory, add that summary to the long-term memory, and clear the short-term memory. Then, we can use the long-term memory as additional context for every LLM call.

This approach still scales tokens linearly with interactions, but ~20x less. This is better, but ideally we want sublinear scaling.

To do that we introduce decaying resolution, turning our memory into DRM. Instead of just having a short term or long term memory, Orin’s memory computes sliding window summaries over the ground truth memories. At any given point in time, Orin’s memory looks like something this:

  1. Direct interaction summaries for today’s interactions. This allows Orin to have a very high resolution memory for everything that happened today.

  2. Daily summaries for what happened in the past two weeks. These summaries are built from the direct interaction summaries each day.

  3. Weekly summaries for the past ten weeks (excluding the past two weeks). These are built from daily summaries.

  4. Monthly summaries for everything more than ten weeks old.

When computing this memory, everything is timestamped and timezones are normalized for the user’s timezone.

As you can see, this can work really well to keep a history of what happened. Orin won’t remember the exact lesson timings from last year, but that’s fine. Instead, he’ll remember that a student was struggling with algebra in July and that they were prepping for a big test - that’s the information we care about.

The best part is that DRM scales tokens sublinearly with the number of interactions.

The only drawback of DRM is that you lose information over time. Depending on what types of information your agent cares about, this is fine.

You could probably layer existing graph-based memory products on top of this as well if needed. We find that memory like this (with a good summarization prompt) is plenty for our use case.

Separately, if your agents needs pieces of information to be stored in a deterministic way, I would urge you to consider how humans remember that information. That data might actually belong in a stateful tool.

I didn’t touch on this point as much in the tool section, but having stateful tools instead of RPC-esque tools can replace many memory needs. Orin doesn’t have to remember every calendar event. He has a calendar for that, just like us.

The downside is that you have to build those tool backends yourself. For example, every instance of Orin (we deploy one per family) has its own SMS inbox, but we have to catch/route/store Twilio webhooks to manage build ourselves—same with calendars. Someone should build a startup around giving agents their own tools, as opposed to the more popular MCP approach of hooking into someone else’s tools.

Next Steps

It’s safe to say that Orin is a pretty big departure from most implementations of agents, which is why we’ve picked a new term to refer to this architecture: entities.

While the term is nascent, entities are loosely defined as AI agents that:

  1. partially or fully control their own wake schedules

  2. have complete control of their own workflow

  3. have persistent, temporal memory

  4. optionally use stateful tools

When you build entities, they might just surprise you. We’ve seen multiple unprompted behaviors emerge from Orin that are actually amazing additions to the user experience, but I’ll let you discover these for yourself.

Building agents with these patterns isn’t “correct”—it’s simply one of the options. It comes with a myriad of downsides, but has the potential to create beautiful products if done right.

One of the largest downsides from building entities is, as you shift more application logic into stateful LLM patterns, your user experience becomes more and more difficult to evaluate. Traditional LLM eval frameworks completely fall short for entities. We’ve had to roll our own version for that too, but it’s probably worthy of another blog post.

We think Orin is one of the first AI entities in production, but if you know of others, please let us know. Given that Orin fails our eval checks with anything except the best reasoning models (like OpenAI’s o3), we’re excited to see how much more powerful they can get as intelligence gets cheaper and faster.

If you enjoyed reading this and have children, I’d encourage you to check out Orin. As of May 2025, we’re in an open-beta. We work best with students aged 10-16 and can work towards a variety of goals like test prep, class support, extracurricular enrichment, etc.

Otherwise, if you’ve thought about these patterns before and want to build them for production use cases, we’d love to hear from you.