How AI Memory Actually Works: Context Windows and RAG

Every day, millions of people open ChatGPT or Claude and pick up where they left off. They say things like "remember that project we were working on?" or "based on what you know about me, what do you think?" It feels like a conversation with someone who knows you, but it isn't.

Large language models have no memory. None. Every single time you send a message, the model receives a fresh prompt, processes it, and returns a response. It has never seen you before and it will not remember you after. The continuity you experience is an illusion, and understanding how that illusion works will make you a dramatically better user of these tools.

TL;DR - The people who get the most from AI tools are the ones who stop treating them like a colleague with a memory and start treating them like a brilliant stateless function. Talk to them conversationally, but know that when something matters, you restate it rather than assuming it carried over.

The cold start problem

Here is what actually happens when you send a message in a chat interface like Claude or ChatGPT.

The application takes your new message, prepends the entire conversation history above it, and sends the whole bundle to the model as a single request. The model reads everything, top to bottom, as if it were encountering the entire conversation for the first time. It generates a response and then the application appends that response to the history. When you send your next message, the whole process repeats with a slightly longer bundle.

This is not a technical detail, it is the entire architecture. There is no hidden state being maintained between calls. There is no persistent thread running inside the model that tracks who you are or what you said ten minutes ago. The model is a function: text goes in, text comes out, and between calls it retains nothing.

The reason this matters is that it changes how you should think about conversations with AI. A long chat session is not a deepening relationship with an increasingly informed assistant, it is a growing document that gets re-read from scratch every time you hit send.

The context window: your conversation has a size limit

That growing document has a hard ceiling. Every model has a context window, measured in tokens (roughly three-quarters of a word per token), and once the conversation exceeds it, something has to give. You should also be aware that the longer a conversation gets, the more tokens you consume each time you hit send.

Current context windows range from 128K to over a million tokens depending on the model and provider. If you have ever worked through a complex coding session, a long research thread, or an iterative document revision, you may have noticed the model starting to "forget" things you discussed earlier. It didn't forget, per se, but rather the application quietly dropped or compressed older messages to make room for newer ones.

This is called compaction, and in most cases it happens silently (in Claude Code it does tell you it's happening). Usually the platform summarizes your earlier messages into a condensed version, but less sophisticated LLM platforms can simply drop/remove the oldest messages. Either way, you lose fidelity. The model is no longer reading your actual words from earlier in the conversation, but reading a compressed approximation of them, or not seeing them at all.

This is why long conversations tend to drift. The model is literally working with an incomplete version of the conversation you think you're having.

How LLM platforms simulate memory

Platform-level memory features (Claude's memory system, ChatGPT's memory) create the strongest version of the illusion of memory. You tell the model your name, your job, your preferences, and the next day it greets you with all of that context. This isn't memory the way we think of it. When you anthropomorphise the LLM, you have to be careful not to misunderstand the tool you're working with.

What is actually happening: the platform extracted facts from your previous conversations, stored them in a database, and injected them into the system prompt at the start of your new session. The model is reading its own notes. It is doing the equivalent of a doctor glancing at your chart before walking into the exam room. The knowledge is real, but it is external to the model, retrieved and inserted by the application layer.

This is an important distinction because it defines the limits of what these memory systems can do. They store discrete facts, not conversational nuance. They do not know the arc of reasoning you walked through last Tuesday to arrive at a specific decision. The texture is gone. Only the labels survive.

The four flavors of fake memory

Every "memory" system in the current AI landscape is a variation on the same pattern: store information outside the model, retrieve it at call time, and inject it into the prompt. The differences come down to retrieval strategy and transparency.

RAG with vector databases is the most scalable approach. Prior conversations or documents are chunked into pieces, converted into numerical representations (embeddings), and stored in a vector database. When you send a new message, the system searches that database for semantically relevant chunks and injects them into the prompt alongside your message. This is powerful, but it is also lossy. You are at the mercy of a similarity search deciding what is "relevant" to your current query. Context that a human would consider obviously important might not score highly enough to make the cut.

Structured database retrieval is more deterministic. Instead of relying on semantic similarity, you define explicit schemas for what gets stored (user preferences, project metadata, conversation summaries) and retrieve it based on rules or direct lookups. This gives you more control, but it also means you are hand-engineering what the system remembers. In practice, it works reasonably well for structured facts and less well for open-ended conversational context.

Platform memory systems are what Claude and ChatGPT offer to end users. These are essentially a managed version of the approaches above, abstracted behind a clean interface. The platform extracts facts, stores them, and handles retrieval automatically. You do not see the mechanism, which is both the appeal and the risk. You have limited visibility into what was stored, what was missed, and what might be wrong.

Markdown files are the most transparent option. Tools like Claude Code use a CLAUDE.md file that the model reads at the start of every session. There is no magic here. It is a text file. You can open it, read it, edit it, and know exactly what the model will "know" about your project. This lack of sophistication is actually a feature, because you can see and control the seams.

All four approaches converge on the same architectural truth: something outside the model is doing the remembering, and the model is reading it cold every time. If you're building a product that needs RAG or structured memory, we've implemented these patterns in production. Crystal aOS uses RAG with vector search for legal document retrieval, and we cover the technical implementation in our guide on integrating LLMs into Next.js apps.

What about models that learn from you?

Cursor, the AI coding editor built by Anysphere, has introduced something genuinely novel with what they call "real-time reinforcement learning." They collect behavioral signals from their user base (whether edits were accepted, whether users sent frustrated follow-ups, whether tool calls broke) and use those signals to retrain their model on a roughly five-hour cycle. The model you use in the afternoon is a different checkpoint than the one you used in the morning.

This is interesting, and it is worth understanding precisely because of what it is not. This is not per-user, per-session learning. The model is not adapting to you during your conversation, it is adapting to the statistical aggregate of all Cursor users and deploying updated weights. Your individual session is still completely stateless, but you are benefiting from a faster version of the traditional train-and-deploy cycle.

Neither Anthropic nor OpenAI has publicly described anything this tight for their general-purpose models. Part of the reason is scale: retraining a frontier model multiple times per day on production traffic is far more tractable when the model is smaller and domain-specific. But it points toward a future where the line between "the model was trained" and "the model is training" gets blurry, even if the individual inference call remains stateless.

What this means for how you use AI

Understanding this architecture changes your behavior in concrete ways.

Start new conversations more often. A fresh conversation is not a loss, it is a clean context window with no compaction artifacts, no summarized-away nuance, and no accumulated confusion. If your thread has gone past 30 or 40 exchanges, you are almost certainly working with a degraded version of your own conversation.

Front-load context. Because every message is a cold start with the conversation history prepended, the information at the top of your chat matters disproportionately. If you are starting a complex task, write a clear, detailed opening message. Do not drip-feed context over a dozen turns and assume the model is building a mental model the way a human collaborator would. It is rereading everything from scratch each time, and if the early messages get compacted, your carefully layered context evaporates.

Do not trust long-session continuity for critical work. If you are making important decisions (architecture choices, legal analysis, financial modeling), do not rely on the model's ability to hold the full context of a session that has gone on for hours. Re-state your constraints. Re-paste your requirements. Redundancy is not waste here, it is insurance against silent context loss.

Treat memory features as a convenience, not a guarantee. Platform memory is useful for casual personalization, but it is not a reliable system of record. If something matters, do not assume the platform stored it correctly, write it down yourself.

Use explicit context files when possible. If your tool supports something like CLAUDE.md or project-level instructions, use them. They are the most reliable and transparent form of "memory" available, precisely because they are not memory at all. They are documentation, and documentation is something engineers already know how to manage.

The honest version

There is nothing wrong with the illusion of memory; it makes these tools more pleasant and more useful. Still, illusions become dangerous when you mistake them for reality and make decisions based on assumptions about what the model knows, what it is tracking, and what it will retain.

The model is not your colleague. It is not building an understanding of your project over time. It is a stateless function that reads a document and generates a continuation. Everything that makes it feel like more than that is happening in the application layer, outside the model, in systems that are useful but imperfect.

Once you understand that, you stop being frustrated when the model "forgets" something. You stop having long, winding conversations when a concise, well-structured prompt would serve you better. You start treating context as a finite resource and managing it deliberately.

The best AI users are not the ones who have the longest conversations. They are the ones who understand the architecture well enough to work with it instead of against it.

One final note

If you aren't familiar with metaprompts, they may help you use AI better. A metaprompt is a structured prompt generated by one AI context window for use in another. For example, you might describe a project to Claude Chat, which has some memory context from your previous conversations, and ask it to generate a detailed prompt that you then pass into a tool like Claude Code, which knows nothing about you or your project beyond what that prompt contains. You review and edit the metaprompt before sending it, because you are the bridge between two stateless systems that have no awareness of each other. It takes a couple of minutes and it eliminates an entire category of "why doesn't the AI understand what I'm working on" frustration. We have a full guide on writing effective metaprompts, and if you've read this far, it's the natural next step.