Video: Build Hour: Agent Memory Patterns | Duration: 3492s | Summary: Build Hour: Agent Memory Patterns | Chapters: Introduction to Agents (18.645s), Context Engineering Fundamentals (125.02s), Memory and Context (389.115s), Context Challenges Visualized (536.74s), Agent Tools Demonstration (695.745s), Context Management Solutions (892.685s), Managing Context Efficiently (1000.9s), Prompting Best Practices (1139.24s), Context Management Techniques (1230.015s), Trimming Context Demonstration (1593.175s), Summarization and Memory (1787.445s), Personalized AI Memory (2084.42s), Memory Design Best Practices (2379.77s), Q&A Session Begins (2578.895s), Evaluating Memory Performance (2681.06s), Memory Scope Usage (2834.035s), Memory Management Strategies (2938.885s), Scaling Agent Memory (3076.605s), Conclusion and Resources (3348.905s)
Transcript for "Build Hour: Agent Memory Patterns": One. Hi, everyone. Welcome back to another build hour. I'm Mikaela on the start up marketing team, and I'm here today with two members of our solution architecture team, Emery, live in the studio, and Brian joining virtually to help address q and a throughout the hour. Hi. I'm Emery. I work as a solution architect to OpenAI supporting digital native customers on building various of AI use cases, including long running AI agents. So today's topic is agent memory patterns, which is a very exciting topic in Emery and I's first ever build hour. So if you've been following along, we started with how to build agents from scratch using responses API, then moved into agent RFT, and today, exploring agent memory patterns. All of the sessions are up on our YouTube channel, so definitely check them out if you want to catch up or revisit earlier builds. Though the focus on the build hour is to empower you with the best practices, tools, and AI expertise to scale your company using using OpenAI APIs and models. So for today's build hour, we'll start with an introduction to context engineering, the foundation for agent memory, and then Emery will walk through several live demos covering memory patterns like reshape and fit, isolate and route, and extract and retrieve. We'll end with best practices, resources, and of course, live q and a. On the right hand side of the screen, you can drop questions into the q and a box anytime during the session. Our team is monitoring both in the room and virtually to help answer throughout, and we'll save a few for the end to go through live. Alright. With that, I'll hand it over to Emery to kick things off. Thanks, Mikaela. Hi, everyone. So I'll start the first part of this, the session with context engineering definition. So this next definition is from Andre Karpathy. I'll start by emphasizing that the context engineering is both an art and a science. So it's art because it involves judgment. So you have to decide what matters most at a given step of, reasoning or action processes. It's science because there are concrete patterns, methods, and measurable impacts to make context management more systematic and and repeatable. So I'll highlight that modern LLMs don't just perform based on the model quality, but they perform based on the context you you give them. In this slide, I'll I want to talk about different disciplines that comes together to basically present the context engineering. So it is a broader discipline than any single technique like prompt engineering or or retrieval. So the diagram visual represents the ecosystem of context optimization layers that together shape, what the model sees and and understands. So you see prompt engineering, as as a core principle, structured output reg, state and history management. Memory is also a crucial part. So using persistent or semi persistent storage like files, databases, or memory tools to upload and retrieve key information. And all of this is contained inside the largest sphere of context engineering. So we can also connect these capabilities into different, product capabilities. Here is a nice summarization slide, that talks about the the core principles, like, why it matters because long running tools or long running and tool heavy agents blow tokens and degrade quality, via, poisoning noise and and and confusion and bursting. We have three core strategies, as we discussed in the beginning, like reshape and fit, to the context window, isolate and route the right amount of context to the right agent, and extract high quality memories to retrieve, in the right time. We also have prompt and tool hygiene as a core principle, So keeping system prompts clean, clear, and well structured, use a small canonical set of, feature examples and minimize overlapping tools and get the tool selection. And our goal in North Star is basically aiming for the smallest high signal context that maximize the the likelihood of the desired, outcome. And then in this slide is where our transition from why context engineering matters to how to to actually do it in in practice. So I'll frame this as a toolkit of of techniques. So these are not mutually exclusive. So most real world agent architectures combine multiple strategies depending on the use case, and the context budget. So the first technique is reshape and fit. We can apply context trimming, compaction, and summarization. The second one is isolate and route. We can offload context, and tools to specific sub agents with a selective handoff. And the last bucket is extract, and retrieve. We can talk about memory extraction, state management, and memory retrieval in in that list, bucket. When we talk about context engineering, it's essential to distinguish between short term and long term memory because they solve very different problems. I think we can group, the first two buckets as short term memory, which we also call as in session techniques. And then the last bucket is long term memory. We call it cross session. So that means you can, collect different information from multiple sessions, and you can retrieve it back in in the next session or other sessions, in in the future. So short term memory is all about, like, making the most of the context window, during an active interaction and active conversation. And then in contrast, long term memory is is about building continuity, across, across the the sessions. Cool. So we we often get excited about how powerful our agents are becoming, how AM models are getting them better and better. They can handle complex tasks, routes between tools. They can plan multiple step workflows. But underneath all that, there is a there is a core bottleneck because context is is finite. So every piece of information we add to the prompt instructions, conversation history, tool outputs, competes for a space in a in a fixed, token budget. And this is why, it matters slide. I want to make the problem concrete here. So I'll frame it around before and after contrast. So you see two conversations, what happens with our memory on the left and what happens with memory on the on the right. So on the left hand side, the user started with the the issues like Wi Fi, battery, and over overheating in IT troubleshooting agent. After many turns, the agent has forgotten the earlier context. It falls back to reasking for information that the user already gave. Right? But on the right hand side, the agent remembers the original issues. Even after many turns, it can pick up their unresolved threat. It references previous actions, like firmware update, background sync, which makes it feel intelligent, and reliable. So this is such a stateful behavior, which is the foundation of, a long running agent. Now I'll switch gears to to failure mode. So we can group these failure modes into four categories. The first one is context burst. So you can imagine it as a sudden token spike in one of our multiple components, to do the limited external control or increased calls, context conflict, if there's any contradictory instructions or information in your context, context poisoning, if there's an incorrect information, that enters the context and propagates over the terms. It can be via summaries or memory objects, state objects you're injecting into the context. And then finally, context noise. So you can imagine it as multiple tool definitions, or, like, way more on many many tool definitions, coming into your context at the same time. This can be redundant or overly similar items so that can make a noise, in the context. Here's a nice visualization of context first, in tool heavy workflows. So you'll see that, there will be, like, a specific increase, in one specific turn, and you'll be injecting, like, large amount of tool tokens here. And then the next one is is context conflict. So we can easily visualize it here. So you can imagine in one of the turns, very specific tool call. And here in this tool call, you see that in the system instructions that never issue a refund if warranty status is not active. But in the middle of the turn, you're also, saying that it's eligible refund for VIP customers. But at the end of the turn and and your agent is responding, hey. You know, given your urgent travel, I can issue a full refund. So this is a nice views visualization or example for the specific, context conflict that can be coming from, one of the the tool results. And the last one is context poisoning. So you can imagine it as a hallucination or something inaccurate mixed into the context in any step, and it propagates across different terms. So here we have couple possible pitfalls here. Lossy summarization edits can can be causing this. If you're using a free form, notes that accumulates over time, that can be contradicted. And then finally, older summaries override the newer ones, and you'll be basically, causing any hallucinations, because of the summary logic, and you'll be injecting that hallucination into the into the context and propagating, over time. Cool. Now I'll stop sharing my screen and switch to the demo that I prepared for you, and then I'll go over all of these some of these challenges, to show you, like, how it actually works in a real world scenario. Okay. Let me share my screen. Cool. So here, I prepared a demo app for for this build hour. It is an IT troubleshooting agent for software solving issues, for for issues related to both software and hardware. And this is a dual agent demo that lets you run two agents side by side. So back end logic states inside the the Next. Js app, and I'll be using OpenAI agents SDK. So we have two tools connected to to that agent. One of them is get orders, another one is get policy. So here, I can start sending a message, and saying hi to to both of the agents. Right? And then you'll see that both of the agents are are responding to my message. And then here, I can say, hey. My laptop fan is making weird noises while I'm playing games. Is it normal? So here you see that the the configuration for both of the agents are same. They're the same model, same reasoning level, but I'm sending the same message, and there is no memory basic, there is no memory configuration yet. So here, you also see the context usage bars at the at the top. So you're you're seeing different type of components, that is already in the context now. And I can say, hey. Before that, I want to see my orders. My order number is 123, 45. And so now I'm expecting the model to to make a tool call and show me, the the orders I have. So here you see that the order status, the items I have, is powered by a specific tool call. So as you see over time, it will accumulate in different tokens and different type of tokens here. And in the context life cycles, I'm visualizing what is happening under the hood across multiple turns. So here you see that I have, 84 tokens of system instructions. My user input is increasing, slowly, but the core component here, is agent output that will be generated by by a model. Cool. So this is a typical real world scenario. So now I also want to showcase, like, how context burst is is happening. So I can still, start with high, and I can say, hey. This time, I'm having an overheating issue on my laptop, and then the model is responding to to my to my message, to my issue, basically. And then here, it's telling me, like, specific instructions, and it's asking me some questions to better understand, what's happening. Right? And then I could say, hey. Thanks. Before that, I want to see the refund policy of my MacBook Pro twenty fourteen. So while I'm sending this message, I also want to quickly show you the code and core concept of how it's working. So it's powered by OpenEdge agents SDK, and here you see, the agent definition. So it's a customer support assistant. I'm having specific instructions here, that I'm adding. I'm using different models here, and, also, I can show you the system prompt really quickly. Instructions. So it's basically I'm I'm saying that, hey. You're you're a customer support assistant for for devices, and then I'm using, like, very slight prompting and instructions for for that specific agent here. So let's go back to the response. So since I asked about the the specific refund policy for for my MacBook Pro 2014, It's made a tool call called get refund, and it's basically returning a specific refund policy that I added before. So here, you see that in between turn two and turn three, there is a specific spike, in the in the context window. So in turn two, I I I have maybe around, like, 300, 400 tokens, but now I have more than 3,000 tokens because I am just dumping lots of information into the context. And then this is a nice example for for context first here. Instead of maybe just dumping all of this information into the context, as as a refund policy, I can be more careful about my tool definitions and tool outputs, to basically make a decision about, like, what should I inject into the context. So maybe not all of this information are are valuable, but as you see that I'll be injecting lots of information into the context that's visualized here, in this context life cycle tab. Cool. Now I'll stop sharing my screen and go back to to the deck. So I'll continue, with the next, steps here. K. Nice. So we talked about challenges and what's going on under the hood and a specific example about context first. So now let's talk about the solution. Right? So the solution is basically, managing context efficiently using different techniques such as, trimming, compaction, state management, memories, and make the natural step beyond prompt engineer. Again, this is also another visualization about different components in the context. So you see that across the times, it's increasing. The token counts are increasing, so these these tokens can become from system message, user message. Maybe you might be injecting, memories, or maybe you might be injecting different, type of specific tokens that can be added into your context. Here I want to group, AI agents in terms of context profiles. So we can group them into three categories. So first one is rack heavy assistance. So you can imagine reports, policy QA agents. In this type of agents context is mostly dominated by retrieved knowledge and citations. The second one is tool heavy workflows. So context is mostly dominated by frequent tool calls and return payloads. And last one is conversational concierge. So you can think about planning agents, coaching agents. And in this case, context is mostly dominated by, growing dialogue history. There'll be lots of tokens in conversation history, like assistant usage tokens that scales with, with session lengths. And then to better understand the the solution and, the techniques, we can go over what is fixed in our context and what is dynamic and in a variable. So here you see different type of components, like, usually, system instructions, tool definitions, and examples unless you're doing a rack. Approach, is is mostly static in the context. What is dynamic is is tool results, retrieved knowledge, memories, and and conversation history. So these are the nice examples for dynamic, and static context and tokens. So you have a control on dynamic tokens, and you can apply different techniques to to control it efficiently. So I would like to start with prompting best practices to avoid avoid context conflict here. You can also find it from our our prompting guides and cookbooks. The first rule is being explicit and structured. We suggest you to use clear, direct language and specific enough to to guide action. You should give room for planning and self reflection. I think this is becoming more and more important with, reasoning models like GPT five, and you should avoid conflicts. So keep the toolset small and non overlapping. Don't use, ambiguous, definitions. Even if a human can pick the tool, the model won't either, so be careful with conflicting instructions and and tool definitions. So for context noise, we talked about, like, many, many tool definitions and many tools attached, into the context as a as an example situation. Again, you you should be explicit and structured in your prompts. More tools as an always equal to to better outcomes. So favorite targeted tools with clear tool decision boundaries, and then return meaningful context from your tools. So in the demo, you show that specific, example of context first. So we suggest you to control basically the what is the tool output, and then return high signal, semantically useful fields, and prefer human readable, identifiers. Nice. So now I'll switch gears to to engineering techniques, and I'll start with the first one, which is reshape and fit. So the first technique here is context trimming. So it is a pretty basic technique. It basically means that dropping older turns while keeping the last, end turns. Here in the turn ad, we have, like, limited context. It's getting getting noisy. There are lots of information in the context. It can be coming from a tool, user message, or different sources, and there's higher likelihood of losing track because, we are getting close to to context. But once we trim the older, conversations, older messages, now we have fresh context. It has better attention, and you'll see that, it will also increase the latency, that you're using. So it basically keeps the last end messages and trips the the previous older messages, and these are just some parameters and now we have control over in in context trimming. The second technique is context compaction. It basically means just dropping two calls or two call results from the older turns while keeping the rest of the messages. So if you have a two heavy agent, you can consider these techniques. You'll see that your context will be mostly dominated by tool results. It'll be noisy. There'll be maybe some context noise and lots of information coming from different tools. And after compaction, you'll see that there'll be fresh context, better attention, and and faster processing. And you will also be keeping the tool placeholders intact after even after the context compaction here. Any question you might have, would be like, okay. How can I decide heuristics about trimming and compaction? So here I can share a couple suggestions. So first, you can you can analyze your sessions. You can collect, like, context snapshots from from production or from your users. You can collect thumbs down and dislike context to see what's going wrong there. Think about the risk token size of a context, what type of task do you have in one session. Secondly, like, do not trim mid turn and break turn blocks. So a turn basically means that a user message and all the other message until the next user message. Right? So if you just break or or don't respect these terms, there will be higher likelihood of losing track. And then finally, don't wait to hit context window limits. So keep track of, of context allocation. You can set thresholds like 40 or 80%. So if you're getting closer to hitting the context window limits, these thresholds will help you to to better understand, basically, when you should trigger some of these operations. You can control tool outputs, and you can also keep track of token saves. So these techniques are also really nice for for cost, cost reducing cost purposes. And you can always keep track of, like, how much token you're saving while you're increasing the the overall capability of the of your agent. And then the next technique we have is context summarization. So it basically means compressing prior messages into structured summaries, and you can inject into the context history. So here, in turn n, you see lots of messages, noisy context again, and you're basically keeping the last n messages and just summarizing or compressing the previous ones, so that you'll have fresh context, better attention, faster processing. And at the end of the day, you'll have a golden summary. So this will be, like, available information because you'll be basically compressing all of the available informations. And at the end of the day, it will have, like, a very dense object that you can keep track of, and that'll be also useful for you to better understand what happened in the conversation. And here's a nice visualization, in the context life cycle, about the summarization. So let's say you perform the summarization a specific term. So you'll see that you are compressing, the all the previous information and injecting it back, into as into the context as a as a memory object. So you see there is a new component, called as memory, after the summarization performed. Nice. So here here is a comparison of summarization versus trimming. So there are different dimensions that you can consider while you're designing a a memory pattern for your for your agent. You can see that in trimming, you'll just keep last entrance, and you'll be dropping the oldest ones. So, basically, it'll be pretty straightforward operation. So it'll be very fast, you know, and there's no no latency. But the trade off is that you you might be losing some information, that is already there. So I think this is the main main trade off that we have. It's it's really best for tool heavy ops and and short, workflows. And then in in summarizing, you're basically just keeping track of all the information so you're you're not throwing away any anything. This can add this can add a little bit of, latency and cost because you'll be doing another summarization call to a model, but you'll be just collecting all the information. So you can think about a, an agent or use case. If you have, like, multiple tasks in your long running agent that are independent from each other, you can definitely consider, trimming because probably the trimmed or throw it away information is not important for the agent for the next turns. But if you're collecting useful information across multiple turns and tasks are dependent to each other, then you can definitely consider, summarization here. Nice. So now I'll stop sharing my screen and go back to to the demo. I'll show you couple examples of these techniques that we already covered here. Let me quickly share my screen. Nice. So here, let's go to configurations page, on the demo. So for agent b, I want to, enable trimming, and I can basically set max turns as three to trigger the trimming operation. I want to keep recent turns, as as three. So here, I can start again to test my my agent and saying hi. So this time, I want to understand the the refund policy, that I want to to check. Maybe I want to refund the laptop I just bought. I can say, hey. I want this refund policy for my MacBook about a month ago, and I want to understand, like, what's happening, here in in terms of refund. So if I'm eligible or if I'm not eligible. So the model is now making a tool call, and calling the get refund tool here. As you see, it's returning that specific information. I have sorted a return window for, for returning that specific laptop. And I also wanna check my my order. So I changed my mind, and I'm saying, hey. Can you also check my order? My order number is 12345. And I want to see, like, if it's if it's on the way, if it's coming. So you see that the model is doing another tool call as as get order. And now I wanna switch gears. I can say, hey. Thanks. I'm having an issue with the Internet connection. So until this turn, you see that I have lots of tool tokens here in the context in context life cycle, so it's getting accumulating over turns. And in that specific turn, now it's telling me that, hey. Let's sort it out. It's asking me a couple questions, about my my device. And if they say, hey. I tried to to load an Internet page and still see 40404 error. I basically share a couple, important information about the situation, and you see still see that the across the turns, it's accumulating lots of lots of tokens, and even agent output is is increasing. Okay. So let's say it's still happening on on Safari, and then this is the last message probably, that I that I want to share about the current situation. And I'm waiting to to have more guidance and instructions, here. And here, you see that at the end of turn six, the context is trimmed. If I go back to here to visualize what's happening. So when I hit the the turn six, you see that it trimmed the context. So it basically removed all these tool tool outputs and tool tokens. So now I have a fresh context here. And then now I can continue talking about the the same specific issue, or I can continue to to talk about, like, different different information. Nice. So let's go back. And then here, I also want to show you how, summarization works. So, also, the compaction is also works in a similar way as as trimming. So here, I can set, like, compaction trigger as for and or keep recent recent turns too. So you'll see that at the end of turn two, it'll be compacting and removing all these tool outputs, and I'll have a a fresh context similar to to the trimming approach. But now I want to be a little bit more advanced here. So here, I want to enable summarization. I want to set the the summarization trigger as five, and I wanna keep the the recent trade chart. So here, I I clicked save, and now I want to just see how it's summarizing all these information. So here, I'm saying, hey. I'm having Internet connection again. So this time, I decided to share more and more information about my my situation. So I can say where I bought this computer, what is is the model. So I can say, hey. I have a 2014 MacBook Pro 14 inch, and I live in The US, but I bought it from Amsterdam. I received it from from a better change service and just updated the OS version last week, and they asked me to update the OS version to Mac macOS Sequoia. So as you see, I'm sharing, like, many information, and these informations are already available for an IT trouble troubleshooting agent. Right? And here, I can go back and say, okay. These are nice. Clarify the problem, like, what I need from you, and I can say, hey. I already tried hard reset after checking the FAQ docs, but it didn't work. So this is still in a form of memory because I'm sharing, like, which steps I tried, which worked, and and which didn't work. I can go back and continue talking with the agents. So now it's reasoning itself and providing me, like, more detailed guidance, and instructions for specific to to a MacBook. So I I go back I went back to my computer, and I saw that the Wi Fi icon is is not active. And then I'm thinking maybe it's related to to Wi Fi or maybe it's related to a specific, software issue. So as you see across the terms, it's it's getting more complex. The the agent needs to reason, and the agent also needs to keep track of what's in in its context and make sure that it's not, there is no burst, there is no conflict, there is no poisoning and other type of failure modes. It's telling me, some specific steps, and I can say, hey. I tried it already. Alright. And I'm wondering if it's a specific software issue. Right? And then here, I'm waiting for a response from from the agent. So, again, I shared lots of information here, lots of available information. So now the agent knows my device, where I bought it did this device, which steps I tried, and, basically, what I what type of steps I performed before, doing that. Cool. So now you see that it's responded to me, like, a very well structured instructions, given what you described. Wi Fi icon is not active, blah blah blah. And then here I see that the context is summarized. And I noticed that there is an orange, component we count as memory item, and the memory item is basically, the summary, that we had. So here between turn four and turn five, you see that I'm I'm I'm condensing some part of the context, and I'm injecting it back as a user message, as as a memory component. So, again, here, memory is basically, the the summarized context from the previous terms. So now I wanna go back to the code and show you the the summary prompt, and go over, like, some important topics about this specific prompt. So as you see, here's my summary prompt. I'm saying, hey. You are a senior customer support assistant for tech devices, setup, and software issues. And then before you write, I'm saying, hey. Be careful with contradictions. Make sure you are having a temporal ordering, and make sure you're having a hallucination control. So I think these are very important things to consider when writing your well crafted summarization prompt. And then I'm tying the summary to my specific use case. So I'm saying, hey. In your summary, write a structured factual summary, and then just think about product environment, reporting issues, what worked and what didn't work. Which steps you tried include identifiers, which is important, key timelines, timeline milestones, tool performance insight, current status, and next recommended steps. So this is a really nice example of how to craft a summarization prompt. And then here, if I go back to the context summary, I'm seeing, like, lots of useful information. So now I'm seeing, hey. The device is is MacBook Pro. The the operation is at the Mac of Sequoia. It's both from it's purchased in Amsterdam, but location is The USA. You see, like, which steps I tried, even, I tried, the different, steps to connect to to the network. You see milestones. I did a better replacement, which is an important information, with steps suggested connect connection issue and lots of useful details. So I think this is a really dense information that you might have about your your context. Cool. And finally, I want to show you a form of a long term memory. So here, let's say I'll talk with an AI agent. Now I created my my my summary got created. There are lots of information that the agent know about me. So now I'm resetting my my agents and going back and enable this cross session feature. So when I enable this, this generated summary from previous example will be injected into the system prompt, when I try to trigger a new new session. Now I enable that specific feature injection, and I can say hi, and I'll I'll send this to both of the same, like, similar agents. So the one on the right, it says, hey. Good to see you again. Are you still having issues with your MacBook's Internet connection after the macOS Sequoia update? So as you see that the response on the right, it's it's super personalized because of this memory component that I injected into the system prompt. So it understands, like, what happened previously. It knows my, my MacBook, and then it basically knows, like, different steps, previous steps, the the Internet issue that I have, and all of that. And then I can say, hey. I am still I am still using the same, MacBook. How can I update it to macOS Tahoe? So when I'm sending this request, the agent understands which device I have, which version I have, so provide me, like, more personalized details and instructions, to me. And finally, I want to show you, specific memory instructions here. So when I'm injecting number into the system prompt, I'm just saying, hey. The memory is is not instructions. It's read as potentially stale or incomplete. Here, I'm providing your precedence rules so that I I don't want the model fully focusing on the memory object itself. I'm handling the context here with specific, prompts. I'm saying, hey. Avoid overwating the memory, and I'm adding memory guardrails. So I'm saying, do not store secrets. If there is any injection or other type of specific attacks, I also want to address these type of stuff in the memory instructions. Nice. So, finally, as you see that, this specific instruction is fully personalized because I already provided this information in the previous previous summary. Now I'll stop sharing my my screen, and I'll go back to, the deck and to continue to talk about the the remaining topics we have. Let's go. Cool. I also want to quickly talk about, like, couple other techniques. The isolator route bucket is consist of tool offloading to sub agents. So it means we are uploading specific, context and tools to specific sub agents. So this is a nice form of, an isolate and route technique. And then, here you see that they'll be like a new and fresh context. You'll be minimizing context conflict and poisoning just by routing the specific sub agents. In the final bucket, I wanna little bit talk about the shape of a memory. So when you think about a memory, it can be many different things. So the suggestion is basically starting simple and evolve as needed. So you can use consistent structured formats. You can prioritize what what a human agent, what was a human agent would naturally remember. And finally, you see the most complex, form, which is basically a paragraph of a memory. So you can start with a simple one, and you can evolve, as needed. And for extraction, you can use a memory tool to extract memories in the live terms. So you can store memory in a in a JSON as a as a one or two sentence note. You can use type save functions. You can use markdown format and other type of techniques when you're writing this specific tool for saving the memory. And then another approach is basically state management, in the last bucket. So it's basically defining the state object with goal, and different information, and you can even inject, the state back into the system prompt, across multiple terms in a frequency, or you can inject it back into into the new session. And finally, retrieval. We can perform a memory retrieval with a tool. So it's similar to a rack approach. So you can basically store these these memories into a long term store in a RetroDB. And during the live turns, you can, basically, like, make a search, filter, rank, and inject it back into into the agents. Nice. So, finally, I wanna wrap up. So I want to reiterate best practices in in agent memory design. So first one is basically understanding your typical context, and you should define what is meaningful for you and for your agent. The second point is deciding when and how to remember and forget. So you can promote stable, reusable facts, memory, and actively forget temporarily, stale or law competence information. And you'll see that your memories will be evolving, over time, and you can continuously clean, merge, and consolidate memories, and you can optimize these steps, in iterations. And finally, evals is is also super important. So you can run your own evals to see if there is any, improvement with memory on and off. You can even build your memory specific evals for long running task and and long context. Awesome. With that, let's move on to some q and a. We've had a ton of great questions come in. So why don't we refresh the presentation, and we'll pull up the next slide and get into a few. Nice. Okay. So let me go back to the q and a session and jump into the questions we have. So let me quickly share it again. Nice. Okay. Yeah. Let's start with the the first questions. Yeah. So there are live any libraries or packages, to work for context engineering. So this demo is built by using, OpenAI's agents SDK package and library. It gives you a really good flexibility, to implement your own sessions. And in these sessions, you can easily implement, like, trimming, compaction, summarization, and all of that type of, techniques easily here. So I see many different libraries, that are evolving really fast to basically, make your life easier for for context engineering. So as you see, we have too many techniques, and each technique has different parameters to tune. So I also see that, there is an, evolving, part of the all of these, libraries. But I can suggest OpenAI agents SDK as a starting point, to basically start implementing specific context engineering techniques, and and go from there. Next. Next one. So how do you evaluate or measure, the memory feature is evolving? The the memory is improving performance. So this is a really nice question. Yeah. After this this session, you might think about, hey. You know, I implemented a specific memory approach, but I don't know if it's if it's good or or not, how it's, performing well. So we can maybe split this into couple, portions. So first one is basically just running your your regular evals with memory and and without memory. So I think this is a really nice way to to start thinking about if if memory feature works or not. So if you have some specific evals metrics like completeness, like uploads, downloads, or that type of, metrics, you can see if there's an increase or decrease or if there's any statistically significant, uplift coming from the memory. And then maybe your emails might not be capturing well that type of memory based boost or improvement, then I I suggest you to think about more like memory based emails. So what I mean by memory based emails is basically, evaluating the model on long running tasks and long context. So if you are not hitting any, context thresholds, maybe, your agent, does doesn't need any of these memory, improvements at all. So, again, you can start with your core core ELLs if you have already. And then, secondly, you can start creating your own memory based ELLs. So you can even evaluate the the the quality of the summary. You can evaluate the injection time. You can evaluate the injection prompt. So there are different ways to to evaluate it. But, of course, in most EVALS, you also might need to prepare a golden dataset first and think about maybe, like, couple 50 examples, like, golden examples of a of a good summary, or you can try different heuristics that I mentioned before to basically find the right balance of trimming and compacting. So I think we can just group this into three different buckets. First one is running your own eval to see see if there's an uplift. Second one is building numerous specific evals. And the third one is basically finding the right heuristics, and parameters to apply in the in the context engineering techniques. Next one. So should we use hierarchical context, like entire project context for our immediate task and context for immediate file edit in questions? So yes. The quest the answer is yes, but it's mostly dependent on the use case. So we also have a a concept called memory scope. So you can think about it as this memory scope as as a global scope. That means if you have a customer or user of your agent, probably there are some information that you should always remember about that specific user. Maybe this user likes more friendly tones. Maybe this this user lives in The US. So these are some examples for global memory, but you can also have, a a scope based on the specific session. So let's say I wanna, book a travel, and then this time, I prefer window seats because I wanna sleep. So this is also a nice example about the the session scope and session memories. So I think this is a good practice to maybe separate these into two two buckets, and you can keep track of, session memories with session scope. And over time, you can graduate session memories into global memories, and you can keep track of, like, what is really important, about, the specific user. So in travel concierge example, if user is always saying, hey. This time, I want window seat, like, maybe multiple times, and you can finally graduate that memory into global memories and keep track, or keep it in in agent's mind, basically, and remember that for the next, next bookings. Nice. Okay. So what strategies do you do you use to keep memory flash or pruned so the agent doesn't become overloaded with stale order? Yeah. This is a really this is another good question. And in the real world, you see that memories are are evolving really fast. So after some time, you'll see that there are some memories that you need to prune, and the agent needs to forget. So in that specific case, there are a couple techniques to apply. So first of them first of it is basically keeping a temporal text. Okay. I I learned this memory, from the user, but I learned it maybe, like, two months ago. So if you can keep track of these, time stamps, or temporal text, the model will understand what is old and what is new. So if I say I like dogs if I said I like dogs, like, two months ago, and today I say I like cats, so you'll see that the model is going to understand my favorite animal now is maybe cats, and it will override the memory, with the right instructions. So this also falls into a little bit to memory consolidation. So how to prune, stale memories, how to basically update, and override the new ones into the existing ones. So temporal text is one technique that you can apply. The other one is you can use a way decay or, basically, a window function, and you can basically focus more on the recent memories, and you can basically downgrade the the oldest ones. So it really depends on the nature of your use case. So if you think that what I said a year ago is not important for your agent, you can definitely prune these old ones and implement, a weighted average probably for all your memories. But if you think that all of this memory is is equally important for your agent, then you can consider maybe, like, memory consolidation and memory override with temporal text. So we can talk about two different, techniques, to manage the overloaded and stale, memories. Nice. Okay. So how do you manage scaling agent memory systems when you have many users with individual and shared memory pools? Yeah. This is another good good example from real world. So once you see the memories are are, evolving over time, and you'll see that you're collecting tons of memories from from your users. So there are different ways to scale it. I think the first path or first first decision criteria starts with if you are basically performing, a retrieval or search based log long term memory approach, or you're just using, basically, summarizing the the context. So if it's the the second one, that means you're just storing all of this information and persist it into into a disk. So you can think about some scaling methods about data management, how to manage, like, a large amount of memory notes, as a text in a text format. Or you can basically think about scaling the first approach, which is basically you have to think about, like, how to scale a search, and retrieval system. You might be storing all of this information, into into a vector database. And then in this vector database, you can try to scale in, storage. You can scale all these all these vectors, filtering, ranking system, and all of that. So I think the first bucket is mostly about this long term memory. So we talked about that memory as a tool so if you can think about extracting memories with a tool and returning back during the lifetimes probably this is the situation where you're going to hit this question about scaling for many users. In this case, you can think about scaling techniques for vector databases. You can use sharding. You can optimize your embeddings model probably if you're using a customized embedding model. And you can basically optimize a retrieval process similar to to a reg approach. Again, the first one, is mostly about scaling a retrieval system. The second one is mostly about basic data storage, how to store specific data, how to manage, tons of information and sentences. I think to wrap up, we can basically put it into two buckets. One of them is, scaling and optimizing a retrieval system. The second one is is also making more efficient for storing and persistent, in in the disk. So this is also a common question that I hear from from my customers. I think you can maybe follow, like, a, like, pilot approach, and you can turn on this new memory techniques for for for a subgroup of your users. And you can think about, okay, how it's evolving over time. Maybe you'll see that most of the memories that your users are saying are are pretty limited. So think about this travel concierge agent. So probably, I'll just sharing my my memories about my seat preference. Maybe, if you want to book a hotel, I like maybe higher floors. Maybe I like the specific menu or breakfast. So I think this is more limited type of groups, type of memory memory possibilities, I can say, and that type of agent. But you are if you're building a life coach or life coach agent, so there are tons of memories that you need to remember, about me, my life, and you'll see that these type of memories and memory pools are evolving really fast. So, yeah, the third point is that try to understand the the evolution of memory and possibilities of memory in your AI agent. So we have two examples here, travel concierge memories and then life coach memories. So, yeah, as you see in the second one, you'll be collecting tons of information that is valuable for, yeah, for my life, and then my dreams, my goals, what I was thinking a month ago or a year ago. So the second one is is mostly, like, super advanced and complex and sophisticated member pool that requires, lots of scaling, for sure. Okay. So, yeah, that was the end of the the question, the QA session, probably. Yeah. Okay. Yeah. And then we can just switch to resources. So, alright. This has been awesome. To wrap things up, we're we've linked a few great resources here, including the context engineering cookbook, which was referenced in the context summarization cookbook and our agents Python SDK. I know we've gotten a lot of questions on, is this available in GitHub? So you can explore all of these links on the right, and the full build hour repo is available on GitHub. So good news. We're likely going to squeeze one or two more of these in before the end of the year, so keep an eye out on our build hours page linked here. And, a big thank you all so much for tuning in, and a big thanks to Emery who's did an amazing job with this session. Yeah. Thanks, everyone. We hope you enjoyed, this build hour on an agent memory patterns. I know we covered, like, lots of different techniques, lots of different information about memory, how to think about memory, how to design memory. So so, overall, as you see, there are too many options, but the core, idea is basically better understanding what your agent should remember and how it should remember and how it should forget. So you can think about these three things when you're designing your own agent memory. And this is still an evolving field, so you you might see, like, some, new features coming, about memory overall. But, yeah, I just wanted to show you, like, different design trade offs, and and guide you with the with the best option. So finding the right balance between these techniques are, usually, like, related to your specific, use case. And then you can keep track of all of all the news and cookbooks, in the resources section. So I'll be also upload uploading this demo page, so demo application to to our build hours GitHub. And then, yeah, thank you for your time, and thank you for for listening, all of this. Yeah. Have a great rest of your day, and we'll see you next time.