Integrating LLMs into Your Next.js App: Streaming, Architecture, and What Actually Matters
Most "AI integration" tutorials show you how to call an API and print the result. That's not integration, that's a fetch request. This guide covers the architectural decisions that actually matter when you're building LLM-powered features into a production application: how streaming works under the hood, when to use orchestration frameworks, and why you probably don't need to rewrite your backend in Python.
The Backend Question: Do You Actually Need Python?
Let's get this out of the way first, because it's the most common misconception.
If you're calling a hosted LLM API (OpenAI, Anthropic, Cohere, etc.), your backend language doesn't matter. You're making HTTP requests and handling streaming responses. The TypeScript/JavaScript SDKs for these providers are first-party, well-maintained, and fully featured. If your app is already built in Next.js or Node, there is zero reason to introduce a Python service just to proxy LLM calls.
When Python actually makes sense:
- Self-hosted models - If you're running Llama, Mistral, or a fine-tuned model locally, the ML ecosystem (PyTorch, vLLM, Hugging Face Transformers) is Python-native. You'll need a Python inference server.
- Heavy preprocessing pipelines - NLP tasks like custom tokenization, embedding generation with sentence-transformers, or document parsing with libraries like Unstructured genuinely have better Python tooling.
- Fine-tuning and training - This is exclusively Python territory.
- Data privacy requirements - If compliance means you can't send data to a third-party API and need to run models on your own infrastructure, you're in Python/CUDA land.
For everything else - calling APIs, managing conversation state, building chat UIs, implementing RAG with a vector database - TypeScript is fine. More than fine. You avoid the operational overhead of a polyglot architecture, a separate deployment pipeline, and cross-service communication just to make the same HTTP request you could make from Node.
How LLM Streaming Actually Works
When a user sends a message to an LLM, they don't want to stare at a spinner for 15 seconds. Streaming solves this by sending tokens as they're generated. But "streaming" is a vague term. Here's what's actually happening.
Server-Sent Events (SSE)
Most LLM providers use SSE for their streaming APIs. SSE is a standard built on top of HTTP - the server holds the connection open and pushes events to the client as plain text. It's unidirectional (server to client only), which is exactly what you need for token streaming.
The wire format is simple:
data: {"id":"chunk-1","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"chunk-2","choices":[{"delta":{"content":" world"}}]}
data: [DONE]
Each data: line is an event. The double newline separates events. The client reads them as they arrive.
Why SSE and not WebSockets? WebSockets are bidirectional and persistent, which is overkill for LLM streaming. You send one request, you get back a stream of tokens. SSE works over standard HTTP, is easier to deploy behind load balancers and CDNs, handles reconnection automatically, and doesn't require a persistent connection pool on your server. WebSockets make sense if you're building a real-time collaborative feature where both sides are constantly sending data. For LLM responses, SSE is the right tool.
Implementing Streaming in Next.js
Here's a proper implementation using Next.js API routes. This proxies the SSE stream from the provider through your server, keeping your API key secure:
// app/api/chat/route.ts
import { OpenAI } from "openai";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function POST(request: Request) {
const { messages } = await request.json();
const completion = await client.chat.completions.create({
model: "gpt-4o",
messages,
stream: true,
});
// Convert the SDK's async iterator into a ReadableStream
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
for await (const chunk of completion) {
const content = chunk.choices[0]?.delta?.content ?? "";
if (content) {
// Re-emit as SSE format so the client can use EventSource or parse manually
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ content })}\n\n`),
);
}
}
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
} catch (error) {
controller.error(error);
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}On the client side, you can consume this with the native fetch API and a reader, or use EventSource. Here's the fetch approach, which gives you more control:
async function streamChat(messages: Message[]) {
const response = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
});
if (!response.body) throw new Error("No response body");
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n\n");
buffer = lines.pop() ?? ""; // Keep incomplete chunk in buffer
for (const line of lines) {
const data = line.replace(/^data: /, "");
if (data === "[DONE]") return;
const { content } = JSON.parse(data);
// Append content to your UI state
onToken(content);
}
}
}Note the buffering logic - SSE events can arrive split across chunks. If you just decode and parse without buffering, you'll get JSON parse errors on partial messages.
Webhooks: The Async Alternative
SSE works great for real-time chat. But what if the LLM task takes minutes (long document analysis, complex agent workflows), or you need to trigger it from a background job?
Webhooks flip the model: instead of holding a connection open, you fire off the request and provide a callback URL. The LLM service hits your endpoint when the result is ready.
// Initiating an async LLM task
const job = await fetch("https://api.example.com/v1/completions", {
method: "POST",
body: JSON.stringify({
prompt: longDocument,
webhook_url: "https://yourapp.com/api/webhooks/llm-complete",
webhook_secret: process.env.WEBHOOK_SECRET,
}),
});
// Webhook receiver
// app/api/webhooks/llm-complete/route.ts
export async function POST(request: Request) {
const signature = request.headers.get("x-webhook-signature");
const body = await request.text();
if (!verifySignature(body, signature, process.env.WEBHOOK_SECRET)) {
return new Response("Unauthorized", { status: 401 });
}
const result = JSON.parse(body);
await db.completions.update({
where: { jobId: result.job_id },
data: { status: "complete", content: result.output },
});
// Notify the client via WebSocket, polling, or push notification
return new Response("OK");
}This pattern is essential for batch processing (Anthropic's Message Batches API, OpenAI's Batch API), long-running agent tasks, or any workflow where tying up an HTTP connection for the duration isn't practical. The tradeoff is complexity - you need to manage job state, handle retries, and push updates to the client through a separate channel.
LangChain, Vercel AI SDK, and When You Need Them
Vercel AI SDK
If you're in Next.js and just need streaming chat, the Vercel AI SDK is the most ergonomic option. It handles the SSE plumbing, provides React hooks, and supports multiple providers:
// app/api/chat/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai("gpt-4o"),
messages,
});
return result.toDataStreamResponse();
}// components/Chat.tsx
"use client";
import { useChat } from "@ai-sdk/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<div>
{messages.map((m) => (
<div key={m.id}>
{m.role}: {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
</form>
</div>
);
}That's a complete streaming chat implementation. The SDK handles buffering, error states, loading indicators, and abort signals. For straightforward chat interfaces, this saves you from writing the boilerplate covered earlier.
LangChain
LangChain (specifically LangChain.js) is a different beast. It's an orchestration framework for building complex LLM workflows - chaining multiple calls, managing memory, integrating tools, and building agents. The JavaScript version is fully featured and doesn't require Python.
Use LangChain when you need:
- Retrieval-Augmented Generation (RAG) - It has built-in abstractions for vector stores (Pinecone, Weaviate, pgvector), document loaders, text splitters, and retrieval chains.
- Tool/Function calling - Defining tools, parsing LLM tool-call responses, and executing them in a loop.
- Complex chains - Multi-step workflows where one LLM call feeds into another, with conditional logic.
- Agent frameworks - Autonomous agents that decide which tools to use based on the task.
Don't use LangChain when:
- You're building a simple chat interface. The abstraction overhead isn't worth it.
- You need fine-grained control over prompts and API parameters. LangChain's abstractions can obscure what's actually being sent to the model.
- You're making a single API call with a system prompt. Just use the provider SDK directly.
LangChain gets a lot of criticism for being over-abstracted, and some of it is fair. But for complex RAG pipelines or multi-tool agents, it saves you from reinventing a lot of plumbing. Evaluate whether your use case actually needs orchestration before adding the dependency.
Practical Concerns That Tutorials Skip
Cost Management
LLM API calls cost real money, and it scales with usage. A few things to implement early:
- Token counting before sending - Estimate request cost before making the call and set hard limits.
tiktokenis one option for this (available as a JS package), or you can use the token counts returned in API responses to track usage after the fact. - Caching - If users ask similar questions, cache responses. Even a simple hash-based cache on the prompt + model combination can cut costs significantly.
- Model routing - Not every request needs GPT-4o or Claude Opus. A task like auto-generating a chat title or extracting keywords from a message can go to a small, cheap model (GPT-4o-mini, Claude Haiku). Summarizing a long document or writing a detailed analysis? That's where you send the expensive model. Route based on the task, not a blanket default.
- Streaming abort - Let users cancel mid-stream. This stops token generation and saves money on abandoned requests.
Rate Limiting Your Users (and Monetizing It)
LLM providers rate-limit you, and you need to rate-limit your users. A single abusive user can exhaust your API quota and affect everyone else. But the strategy you choose also shapes your product.
Option 1: Request-based limits - Simple. 20 requests per hour, reset on a sliding window. Easy to implement, but crude. A one-sentence question and a 4000-token analysis cost you very different amounts.
Option 2: Token budgets - More granular. Give each user a token allowance per time period (e.g., 100k tokens/day) or a monthly total. Track input and output tokens from API responses and decrement their balance. This maps directly to your actual costs.
Option 3: Tiered plans with top-ups - This is where rate limiting becomes a feature instead of just a guardrail. Free users get a small monthly token budget. Paid users get more. And when anyone runs out, they can buy a token top-up instead of waiting for the next billing cycle. This turns your LLM cost center into a revenue line. We've done exactly this, and it works. Users who hit their limit and need more right now are happy to pay for it.
Here's a simplified version of the token budget approach:
async function checkTokenBudget(userId: string, estimatedTokens: number) {
const user = await db.users.findUnique({ where: { id: userId } });
const remaining =
user.monthlyTokenLimit + user.topUpTokens - user.tokensUsedThisMonth;
if (remaining < estimatedTokens) {
return {
allowed: false,
remaining,
upgradeUrl: "/pricing",
topUpUrl: "/buy-tokens",
};
}
return { allowed: true, remaining };
}
// After a successful LLM call, deduct actual usage
async function recordUsage(
userId: string,
inputTokens: number,
outputTokens: number,
) {
await db.users.update({
where: { id: userId },
data: {
tokensUsedThisMonth: { increment: inputTokens + outputTokens },
},
});
}Whichever approach you pick, make the limit visible to users. A usage bar in the UI is far better than a surprise 429 error.
Error Handling in Streams (The Gotcha Nobody Mentions)
Here's a problem you'll run into quickly: errors that show up as chat messages.
When you're streaming via SSE, the HTTP response has already started with a 200 status code. If something goes wrong mid-stream (the provider hits a content filter, your context window overflows, a network interruption kills the connection), the error gets written into the stream as plain text. Your frontend can't tell the difference between a token and an error message, so it renders "Rate limit exceeded" as if the AI said it.
This happens in a few ways:
- Provider errors mid-generation - The model starts responding, then hits a content filter or token limit. The stream terminates, sometimes with an error event, sometimes just by closing. Your UI shows a half-finished message.
- Your proxy writes the error as text - If your API route catches an error and returns it as a plain string, the client has no way to distinguish it from normal content.
- Network interruption - The connection drops. Your reader throws, and depending on your catch block you either get a truncated message or an error string appended to it.
The fix is to use structured SSE events so the client can tell content apart from errors:
// Server: send typed events
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ type: "token", content })}\n\n`),
);
// When something goes wrong mid-stream:
controller.enqueue(
encoder.encode(
`data: ${JSON.stringify({ type: "error", code: "rate_limit", message: "Too many requests" })}\n\n`,
),
);// Client: handle by type
const event = JSON.parse(data);
if (event.type === "error") {
showErrorState(event.code, event.message); // render as a UI notification, not chat content
} else if (event.type === "token") {
appendToMessage(event.content);
}Streaming UX That Doesn't Frustrate Users
Error handling isn't just a backend problem. How your UI responds to failures and interruptions matters just as much.
Let users abort mid-stream. If the model starts heading in the wrong direction, users need to be able to cancel immediately and adjust their prompt. An escape key binding or a visible stop button should abort the stream using an AbortController, and the partial response should remain visible so the user can see what went wrong and course-correct.
const controller = new AbortController();
const response = await fetch("/api/chat", {
method: "POST",
body: JSON.stringify({ messages }),
signal: controller.signal,
});
// Wire this to a stop button or Escape key
function handleCancel() {
controller.abort();
markMessageAsPartial(); // keep the partial content visible
}Don't show abrupt error messages. A red banner that says "Something went wrong" with no context is hostile. If a stream fails partway through, keep the partial response visible and show a gentle inline prompt: "This response was interrupted. Retry?" with a single click to resend. If the error is a rate limit, tell them when they can try again.
Offer recovery, not dead ends. The best pattern we've found is: keep the partial or failed message in the chat history, show a contextual retry button on that specific message, and let the user edit their original prompt before retrying. This turns an error from a wall into a speed bump. Users don't lose their train of thought, and they stay in control of the conversation.
Context Window Management
Every model has a context window limit. As conversations grow, you need a strategy, and each one has real tradeoffs.
Sliding window is the simplest approach: drop the oldest messages when you're approaching the limit. But simple doesn't mean safe. If the user established something important early in the conversation ("my budget is $50k", "I'm allergic to shellfish", "the deployment target is ARM"), that context vanishes silently and the model starts contradicting earlier responses. Users notice.
Summarization/compaction is the next thing people reach for: periodically summarize the conversation and replace old messages with the summary. This is better, but context compaction loses information. The summary is only as good as the model writing it, and it will drop details that seemed unimportant at summarization time but turn out to matter later. You can mitigate this by keeping a running memory file alongside the summary, a structured document that accumulates key facts, decisions, and user preferences as the conversation progresses. But you have to be careful how this memory is written (what gets included, how it's organized) and how it's consumed (injected as system context? appended to the prompt?). A sloppy memory file becomes noise that wastes tokens without helping.
RAG on conversation history is the most robust approach for long-running conversations. Embed messages into a vector store as they come in, and retrieve relevant ones when constructing each new prompt. Pair this with an organized memory file (key facts, user preferences, decisions made) and you get the best of both worlds: the model always has access to the important context without needing to carry the entire history. This is more infrastructure to set up (you need an embedding model, a vector store, and retrieval logic), but for any app where conversations regularly exceed the context window, it's worth it.
Where to Go From Here
The foundation is straightforward: proxy LLM calls through your server, stream responses via SSE, and handle errors properly. The real complexity comes from what you build on top - RAG pipelines, agent workflows, multi-model routing, and cost optimization.
Start with the simplest thing that works (direct API calls with the Vercel AI SDK), and add complexity only when you have a concrete reason. The most common mistake is over-engineering the AI layer before you've validated that users actually want the feature.