Back to Blog

You're Not Getting the Most Out of LLMs

LLMs are good at some things and bad at others. Getting more out of them comes down to matching the tool to the task and running the model in a setup that can actually do work, not just a chat window that talks.

Graham Morley6/23/20268 min read

A lot of people use AI now, but I think a lot of them are getting a fraction of what these tools can do. There are some key things you need to know when setting up your LLM environment.

A language model is good at some things and genuinely bad at others. Point it at the work it's bad at and it will still give you an answer, which is exactly why the mistake is easy to miss. Getting more out of these tools comes down to two habits: knowing which jobs to hand the model and which to hand a script it writes for you, and putting the model in a setup that can actually do the work instead of only talking about it.

💡

TL;DR: An LLM is strong at reading, drafting, and judgment, and weak at work that has to be exact and identical every time. Hand the exact work to a script the model writes for you, and run the model in a harness (Claude Code, Codex, Cowork) that can act on your computer, keep context across sessions, and load your own skills. A plain chat window can only do the talking half.

Chat window, agent, or harness

It helps to know what you're actually using, because these tools come in three shapes that can do very different things.

A chat window is what comes to mind first: ChatGPT, Claude, Gemini. You type, it replies. A modern one can browse the public web, read files you upload, and even run code or build a small artifact in a sandbox. That sandbox is walled off from your computer, though. It can't open your real files, log into your browser sessions, or reach the systems your business runs on. It's strong for thinking, drafting, and answering, but the doing happens in a box you can't connect to your actual work.

An agent is the model put in a loop where it can take actions and react to what happens. Instead of answering once, it can run a command, read the result, write a file, open a browser, hit an error, and try again, working toward a goal across many steps. It's the difference between being told how to do something and having it done.

A harness is the software around the model that makes the agent possible. It manages the context window, gives the model its tools, runs the loop, and loads your instructions, your skills, and your connections to outside systems. Claude Code, Codex, and Claude Cowork are harnesses.

The short version: a chat window talks, an agent acts, and a harness is what lets it act in your environment instead of a sandbox. For real work you want the second and third. I mapped the specific tools in each category in What AI Platforms Do for You.

Hand the model the right half of the job

A language model reads for meaning. That makes it good at judgment, summarizing, drafting, and deciding in a fuzzy situation. It also makes it the wrong tool for work that has to be exact and identical every time.

Here's the example that made me want to write this down. Someone I know was using an LLM to check whether a customer had altered a contract before sending it back. They would paste in the original and the returned version and ask the model what changed.

That's the wrong tool for the job. An LLM reads for meaning, not for exact characters, so it can miss a changed number, smooth over a deleted clause, or report a difference that isn't there. For "did these two documents change, and where," you want an exact comparison, not a reading.

The better version splits the job in two. Ask the model to write a small script that does a precise text comparison of the two files and outputs a clean report of every difference. Then hand that report back to the model and ask it to explain what changed and whether any of it matters. The script does the part that has to be exact. The model does the part that needs judgment.

This is where a harness earns its place, because it can do both halves itself. It writes the script, runs it, reads the output, and reasons about it, without you moving between a chat window and a code editor. A chat window can only do the reading half.

The habit worth building: when a task needs the same input to always produce the same correct output, that's work for a script, and a few lines of code will beat the model on accuracy, speed, and cost. When a task needs reading or judgment, that's the model's half. A lot of jobs are a mix, and the skill is telling the two apart.

Stop re-explaining yourself

A model starts every conversation knowing nothing about you. By default you re-explain your context, your preferences, and your standards every time, which is slow and easy to get wrong. (For why the model has no memory of its own, I wrote a whole piece on how AI memory actually works.)

The first fix is to write your context down once, where the tool reads it automatically. Claude Code, Codex, and Cowork all support persistent instructions: a file the model loads at the start of every session so it already knows how you work.

Let it work on your own computer

For more involved work I go further and give the model real filesystem and computer access, which is what a harness provides. Here's what that actually buys you.

You can keep several windows open at once, all reading the same files, so they share one set of context instead of each starting blind. They can write to those files too, including leaving each other notes in plain .md files. (An .md file is just a text file with light formatting that both you and the model can read and edit.) One window does a piece of work and writes down what it found, and another picks it up from there.

It can also run scripts directly on your computer instead of grinding through a task by hand, which is faster and cheaper, since a script does in seconds what would otherwise burn through minutes of the model's time and your usage.

And the cost model shifts. Harnesses like Claude Code, Cowork, and Codex can run against your subscription rather than billing per token, so running things locally doesn't meter you task by task. That makes a lot of small automation worth doing: a job that runs every morning, a Python script that tidies a folder, generating a PDF, connecting to your email or calendar through an MCP server, a daily digest of whatever you need to see. Local agents quietly doing work on your machine at a flat cost. There's more you can do with it than I can fit here, but that's the shape of it.

Make it yours: skills and plugins

You can go further still, into skills and plugins. A skill is a written procedure for a specific job, the steps and the standards, that the tool pulls in when that job comes up, so it does the task your way instead of a generic version. A plugin bundles a set of those skills together. This is the part the platforms barely mention, and it's where a generic tool turns into one that's yours.

A real example. I built a patent lawyer a plugin of skills for parts of their patent work. Off the shelf, the tool gave generic answers. With skills written for how that practice actually operates, it does the specific work the way the firm needs it, without being re-instructed each time. That's the difference between a clever assistant and a tool fitted to your job.

Setting this up is fiddly, and it's the part people skip, which is part of why the tool out of the box can feel like a novelty while the same tool configured for your work feels like leverage.

The setup is the point

None of this requires you to become technical. It requires knowing what to hand the model, what to hand a script, and what to write down so you stop repeating yourself. Get those right and the same tools you already use start returning more than they do for someone typing questions into a chat box.

Want these tools set up for the work you actually do? I'll look at where they fit, build the skills that make them useful, or tell you where a simpler tool or none at all is the better call. We help teams get that part right.

Get Expert Help
GM

About the author

Graham Morley Founder, Morley Media Group

Graham has been shipping production software since 2011, including SOC 2 and ISO 27001 certified platforms, DeFi protocols managing millions, and AI products that raised venture funding. He builds, advises, and leads engineering for companies at every stage, from a clean website to a complex AI platform.

Work with Graham

Need help implementing these solutions?

Our expert development team can help you build, scale, and secure your applications.