In my previous post, Prompt Caching — what it is, how it works, and how it can save you a lot of money and time when running AI-powered apps with high traffic. In today’s post, I walk you through implementing Prompt Caching specifically using OpenAI’s API, and we discuss some common pitfalls.
Before getting our hands dirty, let’s briefly revisit what exactly the concept of Prompt Caching is. Prompt Caching is a functionality provided in frontier model API services like the OpenAI API or Claude’s API, that allows caching and reusing parts of the LLM’s input that are repeated frequently. Such repeated parts may be system prompts or instructions that are passed to the model every time when running an AI app, along with any other variable content, like the user’s query or information retrieved from a knowledge base. To be able to hit cache with prompt caching, the repeated parts of the prompt must be at the beginning of it, namely, a prompt prefix. In addition, in order for prompt caching to be activated, this prefix must exceed a certain threshold (e.g., for OpenAI the prefix should be more than 1,024 tokens, while Claude has different minimum cache lengths for different models). As far as those two conditions are satisfied — repeated tokens as a prefix exceeding the size threshold defined by the API service and model — caching can be activated to achieve economies of scale when running AI apps.
Unlike caching in other components in a RAG or other AI app, prompt caching operates at the token level, in the internal procedures of the LLM. In particular, LLM inference takes place in two steps:
In short, prompt caching stores the computations that take place in the pre-fill stage, so the model doesn’t need to recompute it again when the same prefix reappears. Any computations taking place in the decoding iterations phase, even if repeated, aren’t going to be cached.
For the rest of the post, I will be focusing solely on the use of prompt caching in the OpenAI API.
In OpenAI’s API, prompt caching was initially introduced on the 1st of October 2024. Originally, it offered a 50% discount on the cached tokens, but nowadays, this discount goes up to 90%. On top of this, by hitting their prompt cache, additional savings on latency can be achived up to 80%.
When prompt caching is activated, the API service attempts to hit the cache for a submitted request by routing the submitted prompt to an appropriate machine, where the respective cache is expected to exist. This is called the Cache Routing, and to do this, the API service typically utilizes a hash of the first 256 tokens of the prompt.
Beyond this, their API also allows for explicitly defining a the prompt_cache_key parameter in the API request to the model. That is a single key defining which cache we are referring to, aiming to further increase the chances of our prompt being routed to the correct machine and hitting cache.
In addition, OpenAI API provides two distinct types of caching in regards to duration, defined through the prompt_cache_retention parameter. Those are:
Now, in regards to how much all these cost, OpenAI charges the same per input (non cached) token, either we have prompt caching activated or not. If we manage to hit cache succesfully, we are billed for the cached tokens at a greatly discounted price, with a discount up to 90%. Moreover, the price per input token remains the same both for the in memory and extended cache retention.
So, let’s see how prompt caching actually works with a simple Python example using OpenAI’s API service. More specifically, we are going to do a realistic scenario where a long system prompt (prefix) is reused across multiple requests. If you are here, I guess you already have your OpenAI API key in place and have installed the required libraries. So, the first thing to do would be to import the OpenAI library, as well as time for capturing latency, and initialize an instance of the OpenAI client:
from openai import OpenAI
import time
client = OpenAI(api_key="your_api_key_here")
then we can define our prefix (the tokens that are going to be repeated and we are aiming to cache):
long_prefix = """
You are a highly knowledgeable assistant specialized in machine learning.
Answer questions with detailed, structured explanations, including examples when relevant.
""" * 200
Notice how we artificially increase the length (multiply with 200) to make sure the 1,024 token caching threshold is met. Then we also set up a timer so as to measure our latency savings, and we are finally ready to make our call:
start = time.time()
response1 = client.responses.create(
model="gpt-4.1-mini",
input=long_prefix + "What is overfitting in machine learning?"
)
end = time.time()
print("First response time:", round(end - start, 2), "seconds")
print(response1.output[0].content[0].text)

So, what do we expect to happen from here? For models from gpt-4o and newer, prompt caching is activated by default, and since our 4,616 input tokens are well above the 1,024 prefix token threshold, we are good to go. Thus, what this request does is that it initially checks if the input is a cache hit (it is not, since this is the first time we do a request with this prefix), and since it is not, it processes the entire input and then caches it. Next time we send an input that matches the initial tokens of the cached input to some extent, we are going to get a cache hit. Let’s check this in practice by making a second request with the same prefix:
start = time.time()
response2 = client.responses.create(
model="gpt-4.1-mini",
input=long_prefix + "What is regularization?"
)
end = time.time()
print("Second response time:", round(end - start, 2), "seconds")
print(response2.output[0].content[0].text)

Indeed! The second request runs significantly faster (23.31 vs 15.37 seconds). This is because the model has already made the calculations for the cached prefix and only needs to process from scratch the new part, “What is regularization?”. As a result, by using prompt caching, we get significantly lower latency and reduced cost, since cached tokens are discounted.
Another thing mentioned in the OpenAI documentation we’ve already talked about is the prompt_cache_key parameter. In particular, according to the documentation, we can explicitly define a prompt cache key when making a request, and in this way define the requests that need to use the same cache. Nonetheless, I tried to include it in my example by appropriately adjusting the request parameters, but didn’t have much luck:
response1 = client.responses.create(
prompt_cache_key = 'prompt_cache_test1',
model="gpt-5.1",
input=long_prefix + "What is overfitting in machine learning?"
)

🤔
It seems that while prompt_cache_key exists in the API capabilities, it is not yet exposed in the Python SDK. In other words, we cannot explicitly control cache reuse yet, but it is rather automatic and best-effort.
Activating prompt caching and actually hitting the cache seems to be kind of straightforward from what we’ve said so far. So, what may go wrong, resulting in us missing the cache? Unfortunately, a lot of things. As straightforward as it is, prompt caching requires a lot of different assumptions to be in place. Missing even one of those prerequisites is going to result in a cache miss. But let’s take a better look!
One obvious miss is having a prefix that is less than the threshold for activating prompt caching, namely, less than 1,024 tokens. Nonetheless, this is very easily solvable — we can always just artificially increase the prefix token count by simply multiplying by an appropriate value, as shown in the example above.
Another thing would be silently breaking the prefix. In particular, even when we use persistent instructions and system prompts of appropriate size across all requests, we must be exceptionally careful not to break the prefixes by adding any variable content at the beginning of the model’s input, before the prefix. That is a guaranteed way to break the cache, no matter how long and repeated the following prefix is. Usual suspects for falling into this pitfall are dynamic data, for instance, appending the user ID or timestamps at the beginning of the prompt. Thus, a best practice to follow across all AI app development is that any dynamic content should always be appended at the end of the prompt — never at the beginning.
Ultimately, it is worth highlighting that prompt caching is only about the pre-fill phase — decoding is never cached. This means that even if we impose on the model to generate responses following a specific template, that beggins with certain fixed tokens, those tokens aren’t going to be cached, and we are going to be billed for their processing as usual.
Conversely, for specific use cases, it doesn’t really make sense to use prompt caching. Such cases would be highly dynamic prompts, like chatbots with little repetition, one-off requests, or real-time personalized systems.
. . .
Prompt caching can significantly improve the performance of AI applications both in terms of cost and time. In particular when looking to scale AI apps prompt caching comes extremelly handy, for maintaining cost and latency in acceptable levels.
For OpenAI’s API prompt caching is activated by default and costs for input, non-cached tokens are the same either we activate prompt caching or not. Thus, one can only win by activating prompt caching and aiming to hit it in every request, even if they don’t succeed.
Claude also provides extensive functionality on prompt caching through their API, which we are going to be exploring in detail in a future post.
Thanks for reading! 🙂
. . .
Loved this post? Let’s be friends! Join me on:
📰**_Substack** 💌 _Medium 💼_**LinkedIn_ ☕_Buy me a coffee!**_
All images by the author, except mentioned otherwise.