Lessons from building full-stack LLM agents
Building full-stack agents represents an exciting frontier in AI, unlocking new possibilities for automating complex tasks at scale. Drawing on my experience developing LLM-powered tools for healthcare at Anterior, this post will dive into key considerations of designing systems to solve tasks using Large Language models. In future posts, I hope to also share insight on scaling, managing rate limits, and mitigating hallucinations—essential to building reliable and efficient AI systems.
The Anthropic blog post on building effective agents inspired me to put down my thoughts from my experience of working in this space for the last two years.
Workflows vs Agents: Defining the boundaries
The Anthropic article emphasizes the importance of simplicity, advising developers to start with workflows when they provide a more straightforward solution than agents.
There is a lot of overlap between a "Workflow" and an "Agent", for sure. Both leverage LLMs in routing tasks, reasoning tasks, filtering tasks and even self-evaluation. However, the key distinction and boundary I like to draw between the two can be summarised as:
- Workflow - when the steps required to generate an outcome are discrete, pre-defined, and often known ahead of time.
- Agent - when the steps required to generate an outcome may need to be "calculated" or determined at run-time, and the outcome is unknown. The system maintains control over how tasks are accomplished.
In my mental model to distinguish between Workflows and Agents; I try to assess where the bulk of the critical "thinking" needs to be done in the system. If it is baked into the code, ie there is a set of rules to follow, then it's a Workflow. If the system is being designed to outsource the critical thinking to the LLM, then it can be classified as an Agent.
See the diagram below for an Agent, where all Instructions are abstracted under "LLM call" which handles both the executed Action, and the incoming Feedback from the environment:
Most existing business processes fall under workflows where systems and people have already been put in place working through pre-defined steps to deliver an outcome. Agents are often seen in the "joints" of workflows where the input, output or next steps are ambiguous.
One way to surface the differences in Workflows and Agents is to discuss the approaches to building them, drawing from real-life examples and assessing the cost-benefit of building the solution using a Workflow vs Agent.
Approach #1: Retrieval Augmented Generation (RAG)
RAG is a component, or building-block, that is often reached for by developers building LLM-native applications and solutions. Current-generation LLMs are good at using this capability to enhance their abilities through the use of retrieval (vector databases), tools and memory.
Whilst there are many nuances to building RAG, the concept is actually very simple.
Give your LLM access to a vector database, a function that it can call, or a datastore for memory -- and then prompt it. After augmenting the LLM with these additional capabilities you'll likely see an improvement in performance and an expansion in the types of problems it can solve for you.
In my experience, RAG has been a useful tool when doing retrieval over a large amount of text data such as complex medial records. These are typically 100+ pages of medical notes on complex medical conditions, procedures and patient journeys.
By extracting the information from a medical record and breaking it into chunks and storing it in a Vector Database, the LLM gains access to a short-term memory store over which it can conduct more refined retrieval.
I've found this to be an effective way to separate the responsibility concerns of data extraction, sanitization and retrieval from the actual medical reasoning logic carried out by the LLM. It also allows for a more scientific approach to retrieval, where the parameters of a cosine similarity search can be tweaked to control tolerance in the system.
Cosine similarity search is only one way of controlling retrieval performance. Working with Vector Databases is an art and science of itself, and deserves its own series of posts.
It's also worth noting that moving retrieval out of the embedded context-window of the LLM, you free up your prompt from dilution, save on the cost of input tokens and evade the challenges of lost-in-the-middle.
Whilst I haven't written about the following topics, it is worth doing your own research to get a more complete understanding on:
- The cost of missing context when chunking data
- The implications of moving full context out of the LLM context window, and into retrieval such as a Vector Database
- Fine-tuning embedding models for better performance and control over dimensions critical to your system
- Storing sensitive information in a vector database
Approach #2: Prompt Chaining
When we built v0 of the Clinical brain at Anterior, our approach was one largely based on vibes. In the very first few days of building, we didn't have a full appreciation for the limitations of models, nor what running out of "context" meant.
So, we took entire medical records and put them into the prompt. We appended the request with a question and instruction, and naturally, were blown away by the results. (It's hard to believe how magical LLMs felt back then!)
"Damn, that was easy..."
So we piled on the the questions and instructions into the prompts util we began to observe more and more hallucinations and instances where the model would fail to respond consistently. Consistency is important when building for healthcare.
In hindsight, the lesson is obvious. We were prompting the LLM to complete a discrete task so we should have been decomposing the request into fixed subtasks. Under traditional software-engineering principals, this would be to write a function to complete one piece of work. I guess the magical nature of LLMs encouraged us to throw everything including the kitchen sink into the prompt.
So, in an attempt to address the hallucinations, we broke the single big task into multiple smaller requests, each with a more refined prompt. In order to address the consistency gap, we introduced "gates" to ensure that between each subtask, the process as a whole was still on track.
This resulted in significantly more LLM requests on longer inputs and medical records to the system, which came at a cost, but generally we were much happier with the outputs. More importantly our customers, at the time, were very pleased with the results Anterior was achieving for them.
As we exposed our prompt-chaining workflow to more and more real-life variations of data, we began to observe areas of brittleness.
The majority of subtasks performed well. However, there were some instances where the task did not live up to our expectations. I have some funny memories of debugging issues where the LLM would outright refuse to answer questions without a bribe, or respond to our questions in the prompt with sarcastic questions of its own! Some of the underlying reasons included:
- A lack of context. Previously, a single "super" task had all the context and leaned heavily into the LLM decomposing the request and navigating it's way to the output. After breaking this into smaller tasks, some tasks that required more context than provided struggled with generating useful outputs.
- The incoming data was largely unstructured and unfiltered. It would include instructions and directions to humans reading the document, and given the ratio of context to instructions in our own prompts, the model would sometimes take the instructions from the document at face value and go off rails completely. For example, some medical records include forms to prompt the Doctor to "tick a box". Our expectation was for the model to extract data on which box was ticked, not attempt to tick the box itself!
Approach #3: Workflow Routing
The idea of using LLMs for more than just prompt-answering really dawned on me when I saw this hackathon submission from my founder friends I met at Neo accelerator. They had found a clever way to make an LLM morph into something it's not: a REST API.
It got me thinking about other ways an LLM could be deployed to efficiently execute tasks it may not have originally been designed for. Routing way one of them.
Using LLMs for Workflow routing is a fancy way of using an LLM as a classifier that can call in with a follow-up task.
I found routing to work well for complex tasks where there are discrete paths that are better handled separately, and where the input to the system (or an output from the previous step in an LLM prompt chain) has enough detail to be handled accurately.
One example of Workflow Routing in action for Healthcare is using smaller and cheaper models to handle easy and common questions in medical records like "Is the patient male?", and using more capable models for harder questions that require deeper knowledge (more training parameters) or the ability to reason over more complex prompts.
Since it is unlikely all customer requests will be made composed of entirely "hard" questions, using Workflow Routing is an effective way to optimise cost and speed of the system. Keep in mind, however, that the task of Workflow Routing adds its own overheads.
Approach #4: Parallelization
One of my engineers spent considerable effort on assessing a concept he was keen to introduce to our system called "self-consistency". The high-level proposition was: what if we run the process several times in parallel and compare the output results to ensure we're accurate?
I appreciated his line of thinking because he recognised the importance of accuracy in Healthcare. Our system could not afford to be wrong too often when we were operating with doctor humans in the loop via co-pilot mode, and our system could not afford to be wrong at all when we were operating in headless mode with no human in the loop.
The concept behind self-consistency lends itself to the idea of parallelization as a whole. In this approach, LLMs can work simultaneously on a task and aggregate their outputs to derive value from the workflow.
Self-consistency is one technique that falls in the parallelization bucket of "voting" where several outputs are weighed up against each other to enable a voting mechanism to decide which answer, or which combination of answers, is the accurate output for the system.
Another technique within parallelization is called "sectioning". This is typically an adaptation from Prompt Chaining when a large task can be broken down into independent subtasks that are run in parallel.
I have found sectioning to be most useful in areas of the pipeline that require guardrails to ensure safe and responsible outputs. For instance, one model can process the actual query whilst the other screens for inappropriate, unsafe or dangerous requests. This technique can be applied in the production run of a workflow, and also within Eval pipelines.
Voting, in my experience, has been useful when dealing with ambiguous inputs open to interpretation. For example, Medical Necessity Guidelines we process at Anterior are often very subjective and contain words like "recently" or "significant". It's often difficult to interpret subjectivity as a human doctor or nurse, and so it's understandable when a Large Language Model doesn't always respond with the same answer despite pure inputs.
In such scenarios, running the same guideline through several models (or even the same model several times!) can yield a variation of responses. By asking the model to be very specific and describing why it might have chosen to approve a guideline that contains subjectivity, we can use an additional model (often referred to LLM-as-judge) to aggregate and make a decision on the path forward.
Approach #5: Orchestrators
All of the approaches up until this point have used LLMs as a point-and-shoot solution to greasing the ambiguity within systems. None of these approaches are ground-breaking, except that they remove a significant amount of burden from the developer and assign it to the LLM instead.
However, it's important to note that these approaches limit the LLMs to grease. The full capability of LLMs is not utilised in these approaches, until we begin to model our systems closer and closer to Agents. Refer back to the diagram at the beginning of this post which demonstrates a single LLM block that takes directly the user input, outputs directly to the environment, and at the same time is able to receive feedback directly from the environment it is operating within.
In an Orchestrator, the LLM is given a set of known tools it can use to answer a prompt and solve the challenge presented.
In an Orchestrator workflow, a central LLM is given the responsibility to dynamically break down tasks, delegate them to workers and synthesise the results. This architecture is not dissimilar to the video encoding concept used to describe microservice architecture.
At Anterior, I found this workflow most suitable for complex tasks where the subtasks could not be predicted ahead of run-time. Though the approach with Orchestrators topographically resembles Parallelization, the key difference is flexibility. It is atypical for early-stage startups to begin applying the Orchestrator approach to their pipelines because it requires a deep understanding of the type of input data to be processed by their pipelines.
One example I found this approach useful was conducting complex reasoning over a long in-patient medical record one page at a time. These medical records would contain varying information in the spread of the pages from details of past procedures, to family medical history, hospital capabilities and the patients' evolving conditions. An Orchestrator approach for this example was used to understand each page individually and apply additional pre-defined worker LLMs to continue the next steps as dictated by the central LLM.
Approach #6: Evaluator-Optimizer
The evaluator-optimizer approach is the next reasonable step from Orchestrators that takes a Workflow system more towards an Agentic system.
In this approach one LLM call generates a response based on the input prompt whilst another provides an evaluation and feedback loop. This approach yields best results when implemented in a "closed system" within a much larger open system. The key characteristic of that closed system is that it has clear evaluation criteria upon which the second LLM can operate the feedback loop.
At Anterior, our first successfully launched product was a co-pilot interface where we used human-in-the-loop Nurses to evaluate the output of the system and articulate their feedback. The value add for our users was that evaluating the LLM output was still faster than manually interrogating the medical record PDF.
At the time of writing, the engineering team is exploring the feasibility of allowing a LLM to provide the feedback instead of a human. This can be achieved in a number of ways, beyond the scope of this post -- and perhaps an ideal follow-up -- such as fine-tuning an LLM with the thousands of human-review datapoints we have already collected.
And finally... Agents!
I started this exploration of building full-stack Agents by examining the possible approaches of building workflows to solve tasks with LLMs. As the needs and requirements of the system grew more complex, LLMs were used less for "grease" in traditional automation systems, and instead their ability to reason became the focal point of any approach.
With an Agentic approach, LLMs are deployed to handle sophisticated tasks, but the implementation is often straight forward. If we stay true to our definition that the system maintains control over how a task is accomplished, and that next steps may need to be calculated during run-time; we realize that an Agent is not much more than an orchestration of all the above approaches. They are typically just LLMs using tools (Approach #5) based on environmental feedback loops (Approach #6) that have self-control over data retrieval (Approach #1), the ability to break down complex tasks (Approach #2) and choose what to do next (Approach #3) and in what order (Approach #4).
Thanks for reading!
My name is Zahid Mahmood, and I'm one of the founders of Anterior. I started this technology blog when I was in high school and grew it to over 100,000 readers before becoming occupied with other projects. I've recently started writing again and will be posting more frequently.