How not to use Prefect
At Anterior we were looking for a tool to help orchestrate our pipelines that completed a collection of tasks as part of our LLM workflows. These tasks were mostly LLM API requests to Anthropic, OpenAI et al and also included some peripheral work such as:
- storing large PDF files
- converting files to PDF
- reading PDF files (OCR)
- creating logical representations of clinical guidelines
- and more...
We ended up settling on Prefect as an ideal candidate for a workflow orchestrator as we prepared for the scale the US Healthcare industry was about to throw at us...
It turned out Prefect wasn't the right choice for us. We originally began to assess it as a solution in June 2024, and by November the team had grown sick of dealing with the onslaught of issues it brought. Some of these issues were down to Prefect not being a good fit for our use-case, and some of the challenges were self-inflicted -- we did a bad job of implementing Prefect in a way that could have served us well.
Through the numerous discussions our Product Engineers and AI Engineers had, many Notion documents and analyses were written. In this post, I have aggregated some of the most pertinent points that I felt were worth sharing for the educational benefit of others building similar rails and infrastructure for their own projects.
This post is divided into two parts:
- Our learnings from using Prefect as a workflow orchestrator for almost six months.
- A conceptual model that has helped us move forward in a post-Prefect world.
What we learned from using Prefect as a workflow orchestrator
It's worth noting our experience with Prefect is entirely constrained to a self-hosted experience. At Anterior, we are building an LLM-powered Nurse for Healthcare Administration in the US which requires all of our technology and operations to be HIPAA compliant. Prefect did not offer HIPAA-compliant infrastructure so we rolled our own.
Prefect was not the orchestrator we wanted it to be
Putting aside many evolving requirements for our product (which is typical and totally acceptable for a Series A startup), we discovered Prefect was not well-suited to Anterior's orchestration needs. Below I have outlined the top three shortcomings we encountered as a team:
Resource overhead
Prefect requires a sizeable amount of resource to operate. Before you even give it a task to complete, Prefect chews through a huge amount of memory just to stay idle. For spiky traffic like early-days Anterior, it was expensive to host a well-resourced Docker container to stay idle and ultimately fall over itself when asked to execute more than a handful of tasks.
Prefect orchestration is best served for long-running tasks
The team was in love with the Prefect Dashboard which showed a waterfall representation of tasks being executed by Prefect. This made observability, debugging and general experience for Product Managers and Engineers a delight.
We could quickly log into the Dashboard and watch in real-time a flow or task complete. We could click in and see which customer had made the request, what the payload included, and which tasks were performing sub-optimally.
However, every time a task is started, Prefect sucks a tonne of memory. This memory is locked up for a period of time after the task completes too, whilst Prefect cleans up. We found that this overhead was not well-suited for tasks that run to completion in under thirty seconds, especially when there may be hundreds being executed at the same time.
Prefect's database connection handling meant we removed the @task
decorator across our workflows
The Prefect Database connection appeared to be an issue when running a large number of tasks over a short period of time. Some of our workflows would spawn several hundred tasks which would all complete a unit of work within a five minute window. We found that Prefect's handling of pooled database connections to the SQL database was inefficient.
As a result, one of our Engineers resorted to removing the @task
decorator in all places it wasn't critical to stop the Database connection overhead from causing Prefect to crash with an Out Of Memory exception.
This was swiftly followed up by another Engineer introducing a SQS queue to retry tasks and workflows if Prefect crashed midway.
As a result, we were beginning to use less and less of Prefect's built-in advantages which in turn made the Dashboard GUI less useful.
After removing the @task
decorator, we took stock of where we had reached and concluded that we were working around Prefect's inability to service our use-case reliably. The time had come to consider alternative options.
SWE lessons learnt from our implementation of Prefect
It would be unfair to document Prefect's shortcomings without also shedding light on some of our own. I believe it always takes two to tango; and so if some tool is not providing us the value we expected we must examine how it was used too.
We did not design our implementation of Prefect for scale, despite building a distributed system
At scale (dimension: volume), everything needs to be idempotent and replicable. I like to conceptually build large systems assuming load balancers may one day need to be placed on all boundaries and interfaces. That means our implementation of Prefect should have assumed multiple Prefect instances.
It didn't.
In fact, it heavily assumed that there would only ever be one instance of Prefect Server through which all Flows would be orchestrated.
This assumption may have been based off an underlying assumption that Prefect Server could handle hundreds of requests per second before exhausting itself. Unfortunately, we learned too late that only several requests per second was the ceiling for us.
In any case, the way we introduced Prefect to our stack provided no simple way to double or 10x the number of Prefect Servers should our customers want to send us more volume.
We bundled our Workers into one Super Worker
At the time of writing, I still don't believe I have sufficiently got this point across to the team, so writing out my thoughts is a good thing. It'll help me articulate why this is so important when it comes to discussing it in person.
I believe Workers should follow a principal of specialisation. Each Worker is to be good at doing one thing and one thing only (see: Single Responsibility Principal).
Yet, in our implementation of the system, every instance of our (single) Worker contained code, overhead and configuration for a variety of tasks.
Take two extremes:
- A worker for web scraping has a dependency on Selenium and Chromium which have a huge footprint
- A worker to make a LLM API request can be as simple as a
curl
command
Our single, "super" Worker was configured to be able to both. That means it would download Selenium and Chromium on instantiation even if it was to only make a brief LLM API request!
Based on our experience, some obvious guidelines and learnings are:
- Each Worker must have it's own isolated and well-defined list of code and package dependencies.
- Each Worker must have it's own isolated and well-defined list of hardware resource dependencies (typically memory, cpu and gpu).
- Separating Worker code means more resources (and/or replicas) can be directed to busier work pools. In Anterior's system there are far more LLM requests than there are web scraping tasks. It makes sense to have five
x-small
Workers for the LLM requests and only onemedium
Worker to be provisioned for web scraping. - Assigned worker resources must consider the role of the Worker: less resources for HTTP requests, more resources for web scraping. Use Python workers for scripts, and <another language> for other tasks where appropriate
Each individual task is synchronous, and we want to orchestrate them asynchronously
If you can stop for a moment and imagine the joys of Python, I encourage you to explore the world of async
Python.
(That was sarcasm)
Anterior's Python developers introduced async
into tasks which in retrospective was perhaps a pre-optimisation and did not consider benefits to be leveraged at the infrastructure layer. An LLM request should be light, and in many cases be the only piece of work a Worker does.
Instead, we created instances where a worker was responsible for executing many LLM requests in parallel, when each one could have been spawned in a new Worker.
The team rightfully pushed back on my point above, especially on my lack of consideration to resource inefficiencies. The point brought to my attention was that a Worker sitting idle whilst an API request is being processed is resource wasted. Particularly with LLMs that often take thirty seconds to respond.
Point taken. And I strongly believe that the cost of optimising idle time like this needs to be 1) earnt, and 2) not worth the benefits you sacrifice. I'll make my point with some example scenarios:
- A Worker is instructed to make one hundred calls in parallel that have no dependencies and one causes the Worker to crash. You now have to re-run all one-hundred calls unless you have sophisticated logging and observability that allows you to retry exactly the requests that did not complete.
- LLM response times depend on a number of factors including the load a provider is experiencing, the number of tokens, the model used and the complexity of a task. If you begin optimising for idle time in your resources before you have strong Product-Market-Fit you will spend an insane amount of time building tools and algorithms to calculate and track LLM providers, tokens, model usage and task complexity without staying laser-focused on what needs to be built to capture customer value.
- Your hosting costs are growing 30% every month. As every team scales up during their free-credits period on AWS, Azure and GCP they will often lose sight of the actual cost of running their system. You will save so much more money tightening your infrastructure than prematurely optimising how a programming language makes a HTTP request.
Ultimately, the pattern to be followed here is that all Workers do synchronous work, and any asynchronous tasks should be spun off in a new Worker. Use your orchestrator to await
task completions when there are dependencies.
The video encoding pattern
Sometimes using shiny new tools pays huge dividends on the quality of your product and the productivity of your Engineering team. At Anterior, we took a bet on Prefect allowing us to move much faster in capturing customer value -- and ultimately that bet turned out to be the wrong one.
Not all is lost, though. Moments like this are valuable for the team to build upon experiences of learning together. Failures often leave wounds that turn to wisdom, and ensure mistakes are not repeated in the future. In the early days, I encourage my team to optimise for speed and learning.
The only thing we could have done better, was pull the plug on Prefect earlier and fall-back to tried-and-tested pattern that has worked for decades: the video encoding pattern.
General system design pattern
The architecture for Anterior's system continues to evolve, yet some fundamentals remain true today.
- The API layer is responsible for ingesting requests
- The Services layer is responsible for managing raw data
- The Workflow (Prefect) layer is responsible for processing data to generate an outcome for Customers
We can isolate the "Workflow (Prefect) layer" (3) into a system of its own, and apply the video encoding model to capture how we want the workflow layer to behave.
Overview
Video processing is very CPU/GPU intensive process that actually mirrors Anterior's approach to creating a Prior Auth determination. I'll first describe my understanding of video encoding on web services, and then translate that to how it might be applied to LLM workflows. The goal of both systems is to take a large amount of input data that cannot be predicted ahead of time, and create some useful output.
Video encoding 101
The system's job is to convert an incoming video from .avi
to .mp4
. The length of the incoming video is unknown until the time of the request as some users will upload a 30s video, whilst others are hour-long conspiracy theory documentaries.
The video encoder used in this system takes approximately 45s to process 60s of video (1.25x), and it occupies a single CPU thread as well as the GPU. Multiple cores are helpful in increasing the speed of encoding from 1.25x, 1.5x, 2x and beyond, however this largely only works for shorter videos since the next limitation the system runs into is memory usage. The entire video needs to be loaded into virtual memory whilst being encoded.
The standard practise here, like any divide and conquer task, would be to split the video into multiple parts, process in parallel to make efficient use of time and idle resources, before then stitching the pieces back together.
An example use case
A user submits a 5GB video that is 2m3s in length. The system is configured to break the video into 30s segments based on a heuristic set by the developer. As a single process on the full 2m3s video, it would take the system approximately 1m50s to run the entire encoding process from start to finish on a single core.
This is the number an improved system utilising divide and conquer needs to beat, excluding other non-functional benefits.
This file can be added as a video processing task on a Queue, with several sub-tasks:
- The first task is to split the video into 30s segments.
- The second set of tasks is to convert each 30s segment from
.avi
to.mp4
in parallel. - The third task is to stitch the video parts back together, only once all tasks from the previous step have been completed.
What this looks like:
- There are five segments to process: 30s, 30s, 30s, 30s and 3s.
- The 30s segments are done in ~22s and the 3s segment takes ~2s. Assuming there are five equal workers available, the job takes ~22s to complete.
- It then takes the system 3s to split the video before processing, and another 3s to stitch the parts back together.
- In total, the flow is completed in ~28s saving 1m22s.
This approach also offers additional benefits such as:
- Using works with a much smaller memory footprint than what would be required to do the full 2m3s video in one go as each worker only needs to hold 30s of video in memory.
- One Worker gets freed up after just 3s to complete another job.
- Segmenting the video means that one failed segment can be retried in isolation. Without segmentation, the entire video encoding process needs to be restarted on any failures and so checkpoints are a good thing.
How does this translate to an LLM workflow?
The video encoding pattern translates well for the medium-long term in systems that use LLM API requests for the majority of their "AI" work. Whether it still holds true as systems grow in complexity needs to be evaluated on a case-by-case basis, and it's impossible to build today for tomorrow's requirements we do not yet know.
- Consider an incoming request as a package of smaller tasks that can be broken down.
- If a task has no dependencies, put it on a Queue and let a Worker pick it up.
- If a task does have dependencies, package it up before placing it on a Queue so that the "packaged" task is not completed until all dependencies are resolved.
- Build your system such that inputs and outputs from tasks are stored as artefacts which allows for checkpoints should single tasks and subtasks fail.
- Use hashing for quick and easy comparison of task payloads when determining if a task needs to be re-run.
- Let your infrastructure handle orchestration. Your application should be optimised for doing work.
This post is not intended to dump on Prefect, nor does it tell the entire story of our experience using the tool. Instead, it's a curation of experiences, observations and lessons from my lens and I publish it with the hope that other developers and system practitioners can benefit from the insight.
Thanks for reading!
My name is Zahid Mahmood, and I'm one of the founders of Anterior. I started this technology blog when I was in high school and grew it to over 100,000 readers before becoming occupied with other projects. I've recently started writing again and will be posting more frequently.