Video: Build Hour: Responses API | Duration: 3089s | Summary: Build Hour: Responses API | Chapters: Welcome to Responses API (7.68s), API Evolution Overview (97.325s), Response API Features (263.73502s), API Migration Example (453.81998s), Migrating Chat Applications (812.97s), Interactive Agent Demos (1055.64s), Handling Model Hallucinations (2199.715s), API Performance Comparison (2280.855s), Conversation Context Management (2398.935s), Open Source Availability (2553.235s), Prompt Caching Explained (2619.005s), Response API Tips (2706.71s), MCP Tool Functionality (2872.395s), Upcoming Build Hours (3031.395s)
Transcript for "Build Hour: Responses API": Alright. I think we are live. Hi, everyone. I'm Christine. Welcome back to another Build Hour. I'm on the start up marketing team, and today, I'm joined with Steve. Yeah. Hi. My name is Steve. I'm an engineer on the API team. Great. So today, we're talking all about the responses API. If this is your first build hour though, just a quick reminder that the goal of this hour is to empower you to build with OpenAI APIs and models, with live demos and also ready to use code repos. So I'll drop that link, in the chat over to the right side of your screen. And you can now find all of our upcoming build hours on our home page, as well as on YouTube. So we heard your feedback. You wanted these to be easily searchable. So if you go to the OpenAI YouTube channel, you can see a playlist with all seven of our last build hours, including g b d five, voice agents, codecs, and built in tools, and many more. So as I mentioned, this is the first build hour of that we're having right after dev day. We saw some really exciting product launches like the apps SDK and agent kit, and we saw this great theme around building agents, hence why we're talking about the responsive API today. So this wouldn't be a build hour without, a custom meme. And this just goes to show why the response API is so important, to talk about especially, as it relates to building agents. So here's what you can expect for today. First, we'll give you a quick brief on why the response is API. Then we'll show you with a live demo how we're gonna migrate to the response API if you haven't already. And then this demo is gonna be really fun. It's gonna kinda peel back the curtain on a day in the life of an OpenAI engineer. And then since this is the first build hour after dev day, we'll give you a little preview of agent kit before the October 29 build hour where we'll do a deeper dive. And then my favorite part is always the q and a. On the right side of your screen, you'll see our chat as well as the q and a function. So in the q and a function, you can submit questions. We have our team in the room who will be answering them, via chat as well as saving some to chat through live at the end. So over to you, Steve. Cool. Awesome. So, yeah, we wanted to talk a little bit today about sort of why we built the responses API, why we felt the need to sort of evolve our core API primitive from what was we previously had chat completions to something kind of brand new, which enables a lot of new functionality, and we hope fixes sort of a lot of the design paper cuts that people were experiencing with the chat completions API. So before I really get into it, I wanna start with a little bit of history. Back in, 2020, we launched our first API, v one completions. And this was really built for an era where models kind of just finished your thought. You had you have a prompt, and it would start exactly where you left off and continue until it was done or it ran out of tokens. And we had an API for this. It was called v one completions. And if you were an OpenAI builder in this era, you might remember the models of this time, GPT three, Text DaVinci, Ada, models like this were really sort of like the sort of frontier of LUNs at the time. But then in 2022, we launched chat GPT. And in the API, we launched, GPT 3.5 turbo. And this was the first model that was really post trained on a conversational format. So instead of picking up where you left off and just continuing, it was trained to respond to you in a way that's more like a conversational partner and a lot less like, like just a sentence finisher. Right? And this API, we famously designed it on a Friday and shipped on a Tuesday, but it really quickly became the defacto standard for LLM APIs. And soon afterward, we shipped features like tool calling and vision in the chat completions API that really helped level up as the models became more advanced. But starting earlier this year with the release of o one, o three, and now GPT five, We have these models that are very different. They're agentic and highly multimodal, and we needed an API that would enable everything from sort of simple text in and out requests to highly agentic long rollouts that could last for minutes at a time. So, I wanna talk a little bit or the yeah. So the response API really combines the simplicity of chat functions with the ability to do more agentic tasks. The response API can simplify workflows including tool use, code execution, and state management. And as model capabilities evolve, we hope that the responses API will be a flexible platform for building agentic applications. And, really, the core piece of this is a lot of the built in tools that we've that we've shipped. So if you used to be an assistance user, you might remember a couple of these from back in the day. We shipped file search and code interpreter and the assistance API, and we've brought these to responses API in, in addition to a bunch of new tools like web search, computer use, co remote MCP, ImageGen, and, of course, function calling that everyone knows and loves. And we believe the responses API will sort of be a, help us effectively enhance the OpenAI platform into the future as models evolve. Cool. So wanna talk about a few things that really set the responses API apart from chat commissions and what really make it different. The first is that the responses API at its core is what we call an agentic loop. So the core philosophy of responses is that, again, it's an agentic primitive. It needs to be able to do multiple things in the span of one API request. So, in comparison, chat completions, which sort of has a design where has n number of choices, one message each, this only allows us to sample for the model one time per request. But in responses, we can sample from the model multiple times. So what's an example of that? Let's say we want the model to be able to write some code and then use that code to be able to give us a final answer. So I can give the model, the access to the code interpreter tool, and I can say, what's the square root of 5,000,000,276? And then what the model can do is it can write some code. It can execute. We could then execute that code server side. We can show the model what the output of that code was, and then it can sample again to then give us a final answer that's based on what the code interpreter did it. So this is what I mean when I say in a a agentic loop. We can kind of do multiple things in a loop until the model finally says, hey. I'm done. Here's your final answer. Second big thing is this concept of items in and items out. So in response to the API, everything is called an item. And so what's an item? An item is this sort of union of types that represent things the model can do as well as things the model can say. So in, by comparison, the Jack inclusions API, everything was a message. Come concepts like function calling were kind of bolted on to the concept of a message. And so kind of handling these cases where the model is, like, doing something and and also, instead of saying something, we're a bit tough to reason about and the code didn't quite look amazing. In response to the API, we've brought we've broken these out as separate types. So a message is a type of item, a function call is a type of item, an MCP call is a type of item, and so on. And this makes it much easier to code around. So when you get these sort of multiple output items, it makes it really easy to sort of write a four loop and then a switch statement that allows you to do different things with items, like populate them in the UI, process them in your back end, or kind of do whatever your application desires. So I'm gonna show a quick example of what this looks like. I'm gonna pop over my terminal. And, on the left here, I'm gonna make a call to responses with g b t five nano, and the prompt is just semi joke. And then on the right, I'm gonna make a call to chat completions with the same prompt. So on the left, we can see that we have an output key, and it's got two items in it. So the first thing in here is a reasoning item. It's denoted by type reasoning. And, essentially, what it is, it's just a receipt that the model thought a little bit before it emitted the the joke, which is sort of our classic, why don't scientists trust Adams. And then on the right, in in the chat completions world, we don't have any receipt that the model did any reasoning. We have the design doesn't really allow us to hydrate these kinds of things from step to step. So we kind of just get this one message with content. If the model we're calling tools, that would sort of be represented in line here. So just a little bit example of what I mean when we say that the responses API is items in, items out. You can take these items and then pass them straight back into your next request, and everything will get sort of rehydrated. So moving on, the responses API is also purpose built for reason models. So responses API allows you to preserve reasoning from request to request. So in the previous example, we saw that the response, the responses API emitted a reasoning item. So if you were to call the responses API again and pass that same same item back, we would be able to rehydrate the chain of thought from the previous request and ensure that the model was able to see it and use it in in the subsequent request. So this works either statelessly or statefully. The responses API is stateful by default. So if you are just kind of using it out of the box and you're passing these items back, we're able to kind of rehydrate this chain of thought out of the database and then pass it back to the model. And what we see is this actually really boosts tool calling performance. For example, in our sort of primary tool calling eval, Taubench, we see a 5% performance increase on the responses API when compared to the chat completions API. Next thing is multimodal workflows. So we've really updated the design to make working with images and other kinds of multimodal content much easier. So if you want to do things with vision, it's really easy to pass base 64 or external URLs to the model with the responses API. We also added support for contact stuffing so you can pass files to the API PDFs. So, for example, if I have my PG and E bill and I say, why was my PG and E bill so high in September? I can just pass that PDF directly to the responses API, and we'll extract your content, show it to the model, and the model can help me figure out what was going on in my house that month. So which is a really cool way to sort of design these multimodal workflows. The next thing is we've really rethought streaming from the ground up. In chat completion's API, the chat the API emitted what we call object deltas, which is sort of a pattern where that forces you as a developer to sort of accumulate every event that comes out of the API and stack them all up to get a full picture at the end of what happened. The responses API, however, emits a finite number of strongly typed events that you don't have to sort of look at everyone to understand what happened. So if you're familiar with the response API, you might recognize a couple of the really common ones. So these are things like output text deltas if you just wanna see the, incremental token that the model is sort of emitting. You can also if you just wanna know when the response started and when it finished or if it failed, those are events you can listen for. And so you can only kind of you know, this is really easy to write a switch statement around, and, you know, the code is very nice to work with and and sort of easy to reason about. And we'll see an example of what this looks like in a little bit. And then the last point is that because we are able to sort of rehydrate context from request to request, we actually see that at p 50, these sort of long multi turn rollouts with the responses API that are where the model is calling multiple functions and then eventually giving you a final answer are actually 20% faster. And, also, they're less expensive because the model just has to emit fewer tokens. So why is that? In response to API, because the what the model will do is plan once and then call a function, and then you'll respond. And then we're able to rehydrate that original chain of thought, and then it can move straight on to the next function and so on. We're able to sort of preserve that chain so that it can sort of move quickly through the rollout and then finish. Jack completions where we have no way to preserve this chain of thought from request to request. The models are forced to think again at every step, which results in many more output tokens. It also results in worse a worse cache hit rate because we're actually dropping the chain of thought at every step, and so you sort of, like, lose that common prefix. So, those are just sort of a few reasons kind of, like, the things we've thought differently about in the response API. And so kind of at the end of the day, we really realized that developers need a way to simplify, how they build these agentic applications. And so I wanna look at how we're changing deployment with our agent platform. At the center is really the responses API and agents SDK. And these are sort of our core building blocks that allow you to build embeddable, customizable UIs in your application. So if you tuned in for dev day or if you were there in person, you saw that we launched agent builder and chat kit, which make it really easy to sort of build these workflows in your application and drop them in with just a little bit of work. And these are built on the response API. And we also see this as sort of the core of what we call the improvement flywheel. So using sort of your if you're using responses stately, you can sort of have build on a corpus of data that you already have to do things like distillation and reinforcement fine tuning. And you can also build evals for your tasks on top of this data, and it makes it really easy and sort of, puts all of this stuff right at the center. And then, you know, in conjunction with all this, we have all of our really great tools that enhance the model's abilities, things like web search, file search, and and so on. Cool. So I wanna show a little bit of a demo about how you might migrate to the responses API if you're still on chat competitions. We know that migrating APIs is not a super fun task, so I just wanna show how this can actually be really easy. So I'm gonna flip over to cursor, and here I have a sort of, like, really simple chat application that I built. It's just, you know, kind of like a worse looking ChatGPT. I can say, hey, you know, tell me a joke. And then if I flip back to cursor, we can kinda see that, we're using the chat completions API to power this. So, you know, if you ever built the chat application on our APIs before, a lot of this will look really familiar to you. You can you're just you're sort of using the built in SDK methods, and then this is just kind of like a single React file. So what we've done to make migrating your applications to chat to my responses really easily is built a sort of migration pack. So what this is is it's sort of a collection of prompts and guides that we use on top of codex, our CLI for agentic coding to actually go and migrate your app from one API to the other. And so the simplest way to get started is just to come in and kinda copy this bash commands. And then I'll get this kicked off, and then we'll kind of show, what this looks like. So I'm gonna paste this in, hit enter. It's gonna ask me what repository I wanna migrate. We're just gonna do the current one. Do you wanna migrate model references to g p t five? Sure. Why not? Branch name, sure looks good. And then do we wanna proceed with danger full access? And because it's a live demo, we obviously do. So what this is gonna do is it's going to kick off a run. It's gonna run codecs in kind of a headless mode here. And then we'll go back to the browser, and we'll kinda show some of the different things that we've kind of built into this pack to help codecs effectively migrate your your integration from one API together. So we've baked in a lot of these great prompts, things like, you know, sort of, the differences between so the the high level differences between the two APIs, some of the guardrails, some of the acceptance criteria. We're also providing it with some docs. So these are sort of migration notes, like, you know, common things where if you were familiar with one concept in chat completions, what does that concept look like in responses? Some of the formatting differences, content items, some of these sort of, like, philosophical changes that we've made that I talked about earlier. So it's kinda gonna feed all these things into codecs and allow it to just, like, really go and cook and then kind of come back to you when it's done. And even though this is a really simple application, that will should only take a few minutes, we are it actually scales really well to larger applications as well. The first few times I ran this, it took about ten minutes. So I'm actually gonna stop here, and then I'm gonna, take the fresh one out of the oven. Right? And I'm gonna say I'm just gonna switch to a branch where I've already finished this. And then we'll do a quick diff. And we can kinda see oops. That was backwards. Yeah. Cool. So we can kinda see what the what the, Codex CLI actually did. So it switched our, conversation mapping to input items instead of messages. It added a couple of extra fields here. It, of course, switched on the whole to g p t five. It included reasoning item encrypted content. So this is kind of what I was talking about earlier when I meant you can sort of rehydrate, chain of thought from request to request even if you're, you know, a ZDR customer or if you wanna work with the responses API statelessly. And then it's sort of changing the streaming handling, to work with the streaming the responses streaming events instead of the chat completions ones. So if I go back to my app, we get sort of, you know, very similar experience, but now we're built on the chat completions API with g p d five. So I can say, tell me to joke again, and the model will think for a little bit, and then, it will kinda come back and tell me tell me the joke. So, this is sort of like a really easy way to sort of at least get started on migrating your application to the responses API if you have a sort of a really deep chat conversations integration today. We, you know, we hope that the migration is very, very easy, but we wanna provide as many tools and guides to make it, as easy as possible because we really think there are a ton of benefits to migrating over. So with that, I wanna move on to my next demo, which is sort of a, sort of, wanted to kind of talk about, like, a little game that we made and sort of, like, how we can use the responses API to really add some, like, cool agentic capabilities to this game. So, you know, I over the weekend, I built this sort of little game calling it OpenAI simulator. It sort of simulates a day of the life of an OpenAI engineer. And I won't tell you exactly how long I spent building this map. I'm a little little embarrassed too, but it's a pretty faithful representation of sort of, like, what our floor, through the API floor at OpenAI looks like, and there's a bunch of characters. And there's two kind of, like, main characters that we want to be able to interact with. We have, Wendy Jay, who is an engineer on my team, and she built some of the great tools that you know and love, like ImageGen and file search. And then, of course, we have Sam Sam Altman, CEO of OpenAI. He's really, really interested in sort of, like, building AGI and helping his employees get there and help guide them to success. So let's look over back to Cursor, and we'll look at sort of, like, how this is configured, and we'll walk through some of some of the code here. So, we have two agents. One is Sam, and one is Wendy. We have some request options here. So we are feeding them both the model, which is g p t five. Sam has some pretty basic instructions. Wendy has some basic instructions. And these are just sort of things like, you know, a little bit of backstory, how you should act, how you should behave. And if we go and back to our game and say, hey, Sam. What's on the critical path to AGI? Sam will think for a little bit, and then he will eventually respond to us. If we pop open our developer tools here, we can kinda see the different streaming, events that are streamed back to us. So we have spots dot create at the start, in progress, output item added. But the problem with this is that it takes a little bit of time for Sam to actually respond to us. And before he, you know, starts talking about frontier model architecture, agents with the memory, we're kind of just left hanging, and it's not that exciting of an experience. So what we wanna do is really is kinda give Sam the ability to, emit his sort of reasoning summary so we can kinda see what Sam's thinking before he actually starts talking. So let's go back to our code, and we'll update our code here. And we'll say, we'll give add the reasoning block, and we'll say, efforts is medium and summary is auto. And what this will do is this enables a reasoning summarizer in the API, which basically looks at the chain of thought that's coming out of the model, decides if it's worth summarizing. And if it is, it will start a sideline sampling process to summarize the chain of thought in a way that's consumable for a user and then start streaming that back. So if we go over to the right, we can see our four loop where we're kind of handling our different streaming events. I'll minimize this one to start, but we kind of have just three handlers right now. We're just looking to see when output items are done, and we'll kinda get back to this stuff in a second. We wanna see when we get text deltas. So this is sort of our final message from the model. When it emits a new token, we basically wanna admit that to the UI, and we're just saying it's type text to give it that sort of visual treatment. And then we just have another one here so we can log the final response when it's when it's completed. So let's go ahead and add a new case here. We'll say case response dot reasoning summary part delta, and then we will change this to reasoning just to give it a different visual treatment. And then we'll go back to our game, and we'll say, hey, Sam. What do you think is on that critical path to building AGI? Sam's gonna think. And in the background, he's gonna emit some reasoning tokens, and, hopefully, we'll start summarizing those tokens and start to see the reasoning summary, in a second. Okay. Cool. So, yeah. So he's thinking about the critical path priorities. Are are you guys not amazing here? So because there's only a little bit of chain of thought to summarize, he summarizes we get a little bit of that summary, and then he kind of launches right into his his final answer. So, anyway, a little preview of how you can use reasoning summaries to make your UIs a little bit more interactive while you're waiting for the model to think, especially if you're using, GPT five. So, no good agent is complete without sort of tools and things that it can do in the real world. And because Sam is so focused on building AGI, we wanted to make sure he was sort of guided and kind of knew what stuff to do. And so we created a linear board that has some of the tasks that you might think are on the critical path to AGI. So things like a memory leak when AGI role plays as a toaster for too long or support for writing breakup texts or please recognition, you know, just some of the basic stuff. And we wanna give Sam access to this linear board so that he can pull from this and then give me things to do in the game. So let's go back to our, IDE, and we'll go ahead and add tools here. We'll give him access to this MCP tool. And what this is is it's just a simple tool definition kind of defining how we want our, API servers to connect to this MCP server. So we give it a type. Type is MCP. But we give it a server label. So this is sort of how the functions in the MCP server are namespaced. We give it a server URL, little description, of course, and authorization token that identifies me as being the owner of this project that we just looked at. And then we can give it some allowed tools. So these are things like, get issue, list issue, create issue, and these are things that we this is sort of an allow list of tools that we want the model to be able to call. There are many, many tools, but, you know, we we don't always trust our agents to be able to work in a way where without supervision. And so sometimes you wanna be able to sort of limit the tools they can call. And then we'll say, always require approval. And then over on the right side, we'll look at some of the code we already wrote to handle this. So, what we see here is we're looking at our output item done event. And if the type of item that we got back is an MCP approval request, we're just gonna pop open a window to confirm, yes, I want to run that that task, and then we'll say, we'll basically just auto approve it and then keep keep sampling. So let's go back. Actually, what we wanna do is add a couple of more couple more, items in our switch statement here for our streaming events. So we'll say response. M c p list tools in progress. We'll emit something to the UI that looks like, listing tools. Great. And then we want to add another one here so that we can omit another event when we're actually calling tool. So let's say if, events dot item dot type is MCP call that we want to, again, emit an event to our UI that just describes what tool is being called. So we'll say, calling, and then we can say our print dot item dot server label dot, events dot item dot name. So we're gonna print out the name of the function that the model is calling. And we wanna add, we added this so that we can kinda know what the API is doing in the back end. So when you first start out when you first make a request to responses with MCP enabled, the first thing we do is list the tools that the server exposes. And this is because MCP servers can be dynamic. The tools can change from request to request and depending on what level of authorization you have. So if I'm a limited privilege user, I might have access to fewer tools than a sort of more privileged admin user. So we just wanna know sort of when this is happening to keep our UI really fresh and and then know what's going on. So let's make sure we save that, and we will add some specific instructions to Sam, kind of instructing him how to use the MCP server. We'll go back to our game, and we'll go down to Sam. And we'll say, hey, Sam. I'm really excited to work on AGI. Can you list a few of the issues in the AGI Linear board I might be able to work on. Seems good thing. So we'll see that he's listing tools. Just gonna think a little bit about sort of, like, how to go about, sort of, like, fetching stuff. We're gonna get a little pop up here that says run list issues and say yes. Sam is thinking again. So he's gonna, hopefully, run this tool on a second. So he's calling linear MCP server list issues. A little hard to see, but you might go see right down there. He's got a list of whimsical issues. I think they're actually pretty serious, but that's fine. Okay. Cool. So he's able to pull from the board. He said, yep. There's a issue with AGI coordinating, Skyrim MPC unions. There's a memory leak for the toaster thing, police recognition, etcetera. And I'm gonna say, hey, Sam. I'm, actually really interested in making sure AGI can tell the difference between, between MacBooks and Windows PCs. Can you add a task in the board for this. And Sam's gonna think again. He's kind of because the whole thing is kinda covered up by the existing text, but we should get a prompt to, she might we'll probably think a little bit, and then we should get a prompt for him to actually create an issue in our board. And so he will be able to sort of interact with this board kind of in real time. So he's thinking about how to create a concise description for the project, So how to distinguish MacBooks versus Windows PCs. He's thinking about how to let me know. So he's gonna wants to run create issue, and I'll say okay. So now he's calling, linear m c p server dot create issue. And then when he's done, then we should summarize sort of what he did and then tell us kind of, what he did. So okay. Great. He was able to create OpenAI 32 distinguished MacBooks versus Windows PCs. And if we come over to our, our linear board here, we see that he actually was able to do this. And if we click in, he's actually given us a pretty thorough description of exactly how to do this. So, let's say that's that's great. I can go off and work on this. So, this is sort of an example of how you can bring really rich information from other parts of the Internet, other, you know, services that you know and love into your applications using something like MCP. And this is as first class support sort of in the response to API as we saw. So we also wanna show a little bit about how we do these sort of, like, multi turn, multi tool rollouts, sort of in one API request. So this is the ability for the model to kinda go out and do multiple things and then finally come back and give you a final answer. And to do that, we'll talk to our other sort of character in the game, Wendy. So let's go back to our code, and we'll give Wendy access to a couple tools and also copy some of these settings over here. We'll give Wendy access to two tools with the access to the web search tool and the image generation tool. And these are two cool tools that allow you to sort of search log the model to search the Internet and then also, sort of have access to the great image gen model that, everybody everybody loves so much. So, we'll give her access to these two tools, web search. This one's pretty simple. You can also add things here like where the user's based, their city, their state, their time zone if you want really sort of localized results. And we'll go over access to the image generation tool. So we're using our GPT image one model, small image, small square image, and low quality just for speed. And then we'll save this. And, oh, actually, we probably wanna know sort of, like, when these two calls are happening. So we'll add a couple of more case statements here to our switch statements. So we'll say caseresponse.websearchcall.searching. And what we wanna do, I'm gonna copy some stuff. Let me say searching the web. Cool. And then we do another one for generating image. So we'll say caseresponse.image generation call in progress, and we'll do a similar thing here. And we're just using the reasoning type here to give it sort of that blue visual treatment in the UI. Say generating image and save. Save over there. We'll go back to our game, and we'll go down and talk to Wendy. Let's say, hey, Wendy. I've never seen French bulldog before. Can you search the web to find out what they look like and then draw me a picture of it. Cool. So Wendy's gonna think a little bit. And what I wanna do is sort of pop open our developer tools, and we can kind of, like, we can kind of see what the streaming events that are happening as they proceed. So Wendy is thinking about image generation tools. She's, you know, kind of thinking and planning about what she wants to do. So she is going to keep it to three to four queries, which should hopefully be enough. She's searching the web. So back in our dev tools, we can see records of this. We can see we have a output item done. We have a a reasoning call here. We have web search call in progress, web search call searching. So we kinda get these state machine events that tell us sort of, like, what, amount or, essentially, what is happening with the tool call at any given time. Web search call completed. And then if we click into one of these events that represents the web search call, we can actually see what the model searched for. So it's looking for AKC. I think that's American Kennel Club, French bulldog breed standard appearance, ears, compact size colors, blah blah blah. And so we can see that Wendy's actually gonna do this a few times and, you know, really gather all the things she needs to kinda give me an accurate picture of what a Frenchie looks like. And then, at at the end, she should hopefully, be able to to show this to me. So she's gonna search through it multiple times. We added some code in our handler here. So we're looking for when an output item is done. And, specifically, we're looking for image generation calls, and those calls sort of have the base 64 data representing the image, sort of stream right back to you. And what we're gonna do, this the code we've already written will sort of open that image just in a new tab when the, full call is done. So let's give Wendy a sec to sort of, like, think about all of what needs to happen. And then, okay. So great. She's starting to think about the image prompts. Okay. This is great. And she's go ahead and drawed me a photo of a French bulldog, which is very cute. So, yeah, this is just sort of, like, a a really brief example of, how the responses API can sort of really level up your applications, help you take advantage of multiple hosted tools, information from outside your application, and really bring your sort of, like, characters in your apps to life. So, Yeah. I think the last thing that I wanna show is actually how sort of a preview of our, agent builder product. So if you tuned in to the and tuned in to dev day or if you are at dev day in person, you saw, Christina sort of give an overview of how to do things with our agent builder product. And so I just wanna show a simple example of how you might recreate some of what we just built in agent builder. So what we have here is a really simple workflow, and I'll kinda click through the nodes and explain what's happening. So the first node we have here is a sort of, like, a web search sort of decision agent. And so we have some of the same backstory here. And, basically, the instructions are to just decide if the query that's being asked requires a web search. And if so, we want to emit some structured data saying, like, yes. This this user wants to search the web, and then if so, what the search queries. And And then we have this failed statement, which essentially looks at the output from the last node and will either then call this agent, which has the web search tool enabled, and we've gone ahead and set some of these localized settings here, or it will call a different agent that is whose job is just to kind of respond conversationally. So we can go ahead and try this out. We'll click preview, and we'll say, hey. What's the weather near me? We can kinda see the nodes activate as they, as they sort of we go through the workflow. So we can see the first one classified my query as wanting web search and the the query being currently near me. If failed statement passed, and then we move on to this agent that I can actually search the web. It knows where I'm located and could tell me that, it's actually a pretty warm day in San Francisco. All things considered 67 degrees or 20, you know, Celsius for, folks not in The US. But then we can go back and ask, like, hey. Actually, can you just tell me a joke? And we can sort of, like, watch the inverse happen. So this, the decision node here should return false, and then we kind of, like, flow into the conversational agent, and we just get a few jokes here. Including the kid cat adds one. I haven't seen that one. That's good. Okay. Cool. So this is a little preview of sort of what our agent builder product looks like. The next build hours will go really, really deep on this. But, this product will be really cool for sort of build a view, build these drag and drop workflows, and drop them right into your applications without the need to write so much code. So with that, I think we can go to q and a. Awesome. Yeah. So you can just do a quick refresh. Refresh here. Okay. Okay. Cool. Okay. Cool. So this person asks, what's the best way to pass example outputs to the model? I'm using g p t five minutei, and I wanna return a structured JSON, but also often find that it can hallucinate. This is interesting. So I we find that, we find that people really have a lot of success with few shot prompting the model. And even though this is sort of a technique that dates way back to sort of the first generation of these chat models, It still works really well for the current generation of models too. So if you want to really give the model really clear instructions on how it should behave when you're finding that it hallucinates, you might wanna give it a few different varied examples. So you say, you know, this is my user message. This is the input. This was the assistant message. It's sort of a good canonical example of what you would want it to say. And you give it a few different examples of this, so it's varied. And so this will kinda give the model, like, a really clear idea of what you want to, what you actually want it to what you actually want it to do in that scenario. So we recommend trying few shot prompting. And if you need if you find that the hallucinations are kind of related to things where it's making up data that it might be able to find from outside its own context, you might try adding in tools like web search to be able to bring in data from the outside web to kind of, make sure that it doesn't just sort of make something up. Great question. Cool. Are there performance differences between the chat completions and responses API? Yeah. This is a great question. So we find that the responses API, again, really, really thrives in these, like, long sort of tool calling rollouts where the model is gonna kinda think for a while and then call a bunch of functions in a row, we find that the end to end performance of these rollouts over many requests is actually a lot shorter, than what it would be in chat completions where the model has to think again between every step because it can't preserve its reasoning from the very first step. The response is the model can think for a little bit and then call a tool, and then you can respond. And then when you respond, we can rehydrate that original thinking content. And then the model knows, oh, okay. Like, that's what my plan was, and I can just proceed the next tool call. Whereas in chat completions, we have no way to preserve that original content, so it gets dropped. The model has to think again before it can continue. And so this you know, the process of thinking is, you know, emitting more tokens, which takes time. And so we find that, by not having to do that, we actually, at median, save about 20% of time. And it's also a little bit cheaper, and you get better cash at rates. We've also, found that, the response API is a little bit better at sort of, well, it kind of enables a stateful query. So if you have, if you have sort of things like we were previously doing had a a tool, like a function call that did something that are one of our hosted tools can, implement. Let's say you have, like, a rag function, so you wanna look something up from a corpus of files. You know, having to round trip that so you get the function call that says, okay. Search the file, then you go do something in your own server, and then you have to send it back to the API. That round trip incurred a little bit of additional latency where the hosted tools and the responses API could have all happened in a tight loop, and, we save a little bit of the time there. And, of course, our retrieval stack is is really is really finely tuned. So a couple examples of where the responses API, you might get sort of better performance than than chat completions. Okay. Cool. How are previous items passed to a new response request? Is this a conversation with this doc's mission or something else? I'm interested in rehydrating my chain of thought over many requests. Okay. This is a really good question, and we actually talk didn't touch on a couple of different ways to do this. There are actually a few. So the most simple one is the one that you're probably familiar with if you've used chat completions in the past. And this is just taking the whole list of items that kind of represent your conversation and then just passing it to the next request. You can just keep appending things to that list and then passing it back. This is kind of how you would have used chat completions in the past. If you don't wanna do that, you we also have sort of a helper called previous response ID. So this allows you to sort of chain off of a previous response and just add one or two incremental input items. And what that will do is it will load the previous response, fetch all the, context from that, and then append to the things that you passed in and then continue from there. So it's kind of an easy way if you just wanna, like, keep a pointer to the sort of, like, head of the conversation without having to sort of manage the context set yourself. A couple months ago, we also launched the conversations API. So if you were an assistance user before, you were probably familiar with the concept of the thread object. The conversation object is sort of the reimagination of that, for the responses API. So you can create a conversation by doing post view on conversations. You get a update feedback. You can pass that to the responses API. And then as the conversation progresses, you just pass incremental input items to the to the your call to be one responses. So you might say, conversation ID is this, and then input items are, a and b. And then we will take those, items and then append them to the conversation so the conversation grows and mutates over time. And then you just really have to keep a reference to the one conversation object you created. It's really great if you're sort of building a chat UI, and you can just list the things in that conversation if you wanna render a UI out of it. It's sort of like a really easy way. So it's kind of a few different ways to rehydrate context from request to request depending on sort of your specific application and sort of, like, how much you want to actually manage the context yourself. And, again, if you're using responses statefully, so if you're using the conversation object and things are being persisted on our servers, the sort of chain of thought rehydration is happening automatically. If you are using it statelessly, so you're using the store parameter and setting it to false, then you can include the encrypted content sort of round trip that, round trip that encrypted reasoning so that you, don't have to so you sort of get the benefits of rehydrating the chain of thought. Or if you're a ZDR customer, same thing. You can kind of use that include gram to, be able to rehydrate that stuff. Cool. Will the template be available as OSS? I think maybe the if the template I think all the stuff we showed today, we'll we'll all put on the, I'll put on the, sort of, our build hours GitHub. So all that stuff should be available. So, yes, hopefully, anything that you saw today will be available on GitHub. Cool. I'm working on a paramedic simulator using multiple agents and roles, patient, bystander, dispatcher, etcetera. Any chance that you are posting this code, it seems like a great jump start. Yes. Exactly. This game that we put together will be available on the build hours GitHub, so you can feel free to to fork it and kind of do whatever you like. I added those because it sounded like people were really, interested Yeah. And and, also playing a day in the life. Nice. Yeah. Exactly. Yeah. It's pretty fun, I would say. Yeah. It's a good time. Yeah. You can make it your own. Exactly. I know we have some time left. So if you wanna do a quick refresh, I just submitted two more questions. Awesome. Okay. Cool. How does prompt caching work, and how can I edit my prompts to take advantage of prompt caching? So, yeah, it's pretty straightforward. Basically, the way that it works is once you pass in all of your context, we sort of construct, sort of underlying representation in tokens, and then we take all those tokens, and then we send those to the model. And at the model level, there is sort of a cache that will basically try to match a certain amount of, basically, a basically do a kind of a prefix match on the tokens you passed in with what may already be in the cache. So let's say you, made a call to responses API and you said, hey. Tell me a joke. Well, we will tokenize that conversation, send it to the model, and if we find that exact prefix in the, sort of model's cache, then those will count as cache tokens, and you you will pay a discounted price for those input tokens. So the way to really take advantage of prompt caching is to sort of not change the earlier parts of your context in between requests. Because if you remove something, let's say, three items up, then, essentially, you will, change the prefix. And so anything that came comes after that will be lost because the prefix will be a little bit different. The this sort of set of tokens will be different. So the way to sort of, take advantage of the cache most effectively is to treat your contacts as sort of an appendal new list and just keep a kind of do it if you sort of change things earlier in the, earlier in the flow, then you'll kind of, like, lose out on a little bit of the benefit of of prompt cashing. Cool. What are some of the most common mistakes you see with using the response API? This is a good one. We definitely see that, you know, we I think that the the biggest thing that the response idea offers is sort of this advantage of being able to rehydrate chain of thought from request to request. And so we see sort of a common pitfall for sort of ZDR customers that are stateless by default. You know, you there's sort of an the additional opt in step of requesting the encrypted content so that you can rehydrate that and, sort of pass it back. And so so if you don't do that, you know, you won't sort of, like, by default, you won't be able to take advantage of that. So definitely recommend if you're using responses statelessly or if you're a ZDR customer to always, sort of request that encrypted content so that you can pass it back and take advantage of the full abilities of the reason models. Another thing we see is I'm trying to think. Another thing we see is definitely, folks trying to sort of, like, roll their own versions of hosted tool that we have. We think that we offer some, like, really, really great host of tools, and they make it very easy to get started, especially if you're doing something like building a, a rag pipeline where you want the model to be able to search over a corpus of documents. This is a a really hard thing to really do well, and our team has spent a ton of time on getting it right. And so definitely recommend where possible, especially if you're just getting started and you don't need so many knobs. So sort of take advantage of the built in tools and, you know, really get all the power out of the platform to really get yourself ramped up quickly. The other thing that we really wanna try to encourage people to use to do is try out some of the other sort of, objects in the responses API ecosystem. So talked about the conversation object a little bit. This is a really easy way to get started, especially if you're building sort of a chat interface. The other really great object that we shipped a few months ago is the prompt object. So if you have a task in your API, let's say or in your application, let's say, translate this or, you know, this is sort of the outline for your character and maybe the game example that we went through. You can create a prompt object in the dashboard that is sort of a stateful object that describes you can give it instructions. You can say, you know, you are character x y z or your job is to translate the content below from English to Spanish. You give it some tools. And this basically allows you to sort of define a task and then, sort of just reference that in the responses API with the prompt ID. And then anything, you can version these prompts and iterate on them, and it really helps you if you'll climb on your tasks. Maybe you have to have an eval for this specific thing and you wanna make it better. You can just kind of change the prompts, save it, take the new version, drop it in the responses API, and sort of all your calls will get all the benefits of that without having to hard code all that stuff into your application. So few common pitfalls, few ideas if you've sort of never tried those, those objects before. Awesome. These are some of our resources. Our q and a is going off. Do you have time for one last question? Yeah. This one is about, would you explain how and when the MCP tool calls happen in the response API and what you have seen folks doing with that functionality? Totally. Yeah. So, MCP is really cool. We're, like, we're really excited to support in the responses API. Basically, the way that it works is if you enable an MCP server in responses, the first thing we do is we'll go out and reach out to the MCP server and just say, hey. What are the tools you have available for my user to use? And it will basically return if you've used function calling, it's very, very analogous to function calling. We kind of refer to it as, like, remote function calling almost. And so it returns a bunch of function definitions, essentially, and we feed those into the model. And we say, here's all the list of functions that you have, and they're all named space. So we can we can tell sort of the difference between a function that you provided that you want us to yield back control to you to handle versus an MCP function that we need to reach back out to that server to execute. So in the linear example, there's a bunch of functions. There's, like, 40 functions. We only looked at a few of them. But the sort of canonical example would be we make a request. The responses API will call out to the MCT server to say, here's my user, off token. What functions does this user have available? It will return this list. We'll show that to the model. Depending on my prompt, if I say, like, hey. Create an issue in x y z board for, you know, this description, then the model can identify the right tool. It will emit some JSON that says, like, here's the tool I wanna call and here are the arguments. And then we'll take that, and then we will send that back to the MCP server. So we're reaching back out to linear with that with that, JSON, and the linear MCP server will do something with it. In this case, maybe create an issue in the project, and then it will return an acknowledgment back to us saying, like, okay. Great. Here's a representation of the issue I created. The model can then look at that and then summarize sort of what it did in its final answer. So, yeah, that's a little bit about, you know, how MCP was working under the hood. Some cool things we've seen doing people do with MCP are building, you know, people really like, to give sort of their agentic coding tools like CodexCLI access to, you know, other parts, other things that, they might have to pull from. So, again, linear is a great example. Let's say I spin up codecs and I say, like, hey. Can you just pull, this issue from linear? It's already got the stuff in it, or I want you to work for these five issues. It can sort of of work with linear to sort of pull a few things and then work on it independently without you having to stop and prompt it every few every few turns. So, with codec CLI built on responses API, and then you kind of get all that cool functionality by by virtue of that. Awesome. Thanks so much, Steve. This was really helpful. Love seeing all of the questions come in. We're trying to answer them all in the chat, as well as live. Mhmm. But if you didn't get your questions answered this time around, we have more build hours. So I'm gonna just hit the next slide. Awesome. So October 29 is all about agent kit. It's really a deeper dive of what you've seen today. So bring those questions there. If you have more questions around the responses API, we'll get to them there as well. Then November 5 is all about agent RFT. So once you've built your agent, how do you make them better? And then December 3 is about agent memory patterns. Super excited for that one. All of the build hours are available on YouTube as well as on demand on our home page, and that's where you can sign up for all the other future build hours. And with that, we'll wrap up. So thanks so much for attending, and we'll see you at the next one.