Agents aren’t all you need

Lessons from Parcha's Journey automating compliance workflows using AI and why autonomous agents aren’t always the best solution

Agents aren’t all you need

The Promise of AI Agents

We founded Parcha in early 2023, when "AI agents" built with language models emerged. We were inspired by demos showcasing how LLMs and code could dynamically create and execute complex plans using agentic behavior. Frameworks like AutoGPT and BabyAGI emerged, and people started applying this technology to numerous use cases. The demos looked promising, and the excitement around autonomous AI agents was palpable. Soon after, we saw startups emerging for horizontal (AI Agent platforms) and vertical (AI Agents for specific industries) use cases like Parcha..

Starting Parcha: The Thesis of AI Agents

We founded Parcha with the thesis that AI agents could automate many operational processes in fintech and banks. Our initial approach was to take a standard operating procedure (SOP) that a human follows to perform an operational process and have an AI agent read and execute those steps autonomously. We developed an agent that would read the SOP and assume the role of the operations expert performing this process today, such as a compliance analyst.

Thanks for reading The Hitchhiker's Guide to AI by Parcha! Subscribe for free to receive new posts and support my work.

The Parcha agent would generate dynamic plans based on the SOPs provided by our design partners. These plans would guide the agent through each process step, utilizing various tools and commands to execute tasks and record outcomes in the scratchpad. Once it has performed all the steps, the agent will use the SOP and the scratchpad to make a final determination, such as approving or denying a customer, and deliver the result to the end customer. We aimed to create a flexible and adaptive system capable of handling many workflows with minimal human intervention.

We were not building in a vacuum. From early on, we engaged with design partners — fintech and banks- with complex fintech workflows and compliance processes that agreed to work with us and test our agent. These companies saw potential in our technology to achieve their tasks without scaling up their workforce. We used their SOPs and even shadowed sessions where their teams performed these processes manually. We used our platform to build agents for Know Your Business (KYB), Know Your Customer (KYC), fraud detection, credit underwriting, merchant categorization, Suspicious Activity Reports (SARs) filings, and other processes.

The feedback was very encouraging. The demos quickly impressed our partners, and we felt like we were on the path to building a general agent platform to cover multiple processes. However, as we transitioned from demos to production, we found ourselves (a tiny team, by design) spending too much time building the "agent" platform and not enough time building the product and solving the problems our design partners wanted us to address.

While the concept of agentic behavior was promising, building reliable agentic behavior with large language models (LLMs) was a massive endeavor. Creating general-purpose "autonomous agents" could have taken us years. During that time, we wouldn’t solve the problems our customers cared most about with a product that directly addressed their needs. Our customers required accuracy, reliability, seamless integrations, and a user-friendly product experience—areas where our early versions fell short. They would much rather have a solution that was very accurate and reliable for a subset of tasks than a fully autonomous solution that could automate a workflow end-to-end but worked only 80% of the time. We needed to choose between building the agent or building the product.

The Complexity of Putting Agentic Behavior in a Box

As we ran our agents in actual production use cases, we began to uncover the inherent complexity of agentic behavior within a system. From systems theory, we know that as the number of components and interactions within a system increases, so does the complexity and the likelihood of failure. Our approach, which relied on dynamically generated plans and a shared memory model (a scratchpad), made it difficult to decouple and measure subsystems effectively. Each run produced different plans, leading to an ever-shifting landscape of tasks and interactions, which increased the unpredictability and potential for errors.

To put this into perspective, if an AI agent carries out a workflow consisting of 10 tasks autonomously but has a 10% error rate per task, the compounded error rate over the whole workflow is 65%.

Imagine a factory where the assembly line is rearranged dynamically with every run. In such a factory, the placement and order of machinery and tasks change constantly, making it impossible to establish a consistent and reliable workflow. This is akin to what we faced with our agentic behavior model. The dynamic generation of steps—where the agent created an execution plan on the fly—meant that no two executions were identical, leading to an ever-shifting landscape of tasks and interactions. This constantly evolving setup made it exceedingly difficult to instrument and evaluate the system, further complicating our ability to ensure reliability and performance.

The scratchpad, intended to facilitate the agent's observations and results, compounded these issues. Instead of simplifying the process, it created a tightly coupled system where decoupling tasks became nearly impossible. Each task depended on the preceding tasks' results recorded in the scratchpad, leading to a cascade of dependencies that we found challenging to isolate and test independently. Moreover, these dependencies made parallelization impossible. Tasks that could run in parallel had to wait, as the agent needed to evaluate the scratchpad before deciding the next steps.

This tightly coupled system with dynamically generated steps is inherently prone to failure for several reasons:

  1. Unpredictability: Each execution path varied significantly, making predicting the agent's behavior and outcomes difficult. This unpredictability is a hallmark of complex systems and a primary source of failure.
  2. Complex Interdependencies: Changes or errors in one part of the system could propagate through the scratchpad, affecting subsequent tasks and leading to systemic failures. The reliance on shared memory created a cascade of dependencies that amplified the impact of any single issue.
  3. Artificial Performance Bottlenecks: Tasks that could run in parallel ran serially because the agent needed to evaluate the scratchpad constantly. This bottleneck reduced efficiency and increased processing time, countering one of the critical advantages of using AI agents.
  4. Evaluation Challenges: The constantly evolving setup made instrumenting, testing, and evaluating the agent's accuracy difficult. Determining the root cause of errors was challenging. Was the issue with the dynamically generated plan, a misunderstanding of the SOP, a malfunctioning tool, or a flaw in the decision-making process? This complexity hindered our ability to ensure the agent delivered accurate and reliable results, ultimately impacting our ability to meet customer needs effectively.

These challenges underscored how impractical it was to rely solely on autonomous agentic behavior for complex workflows. Building a reliable, robust, and stable AI agent would require significant investment. Solving these challenges would have required all our resources for an extended period. Following that path would have prevented us from building our product, generating revenue, and iterating based on customer usage and feedback.

The agent “on rails”

As we spoke to more customers, we realized that for a specific use case like carrying out Know Your Customer compliance reviews, most companies used a variation of the same process and repeated that same process repeatedly. That meant the agent didn’t need to develop a new plan every time it ran. We could move faster while leveraging what we built by putting the agent "on rails." We removed the dynamic generation of the plan and the shared scratchpad and instead developed an orchestration framework based on static agent configurations. For example, a KYB agent needs to perform the following tasks (simplified for illustration):

  1. Verify business registration
  2. Verify business ownership
  3. Verify the web presence of the business
  4. Verify if the business operates in a high-risk industry
  5. Verify if the business operates in a high-risk country

This plan started as a configuration file! Now that the plan was static, we could parallelize the process. Tasks 1, 2, and 3 can be done in parallel, while tasks 4 and 5 depend on task 3 as we scan the business's website to understand where and in which industry it operates.

This approach allowed us to break down SOPs into discrete, manageable tasks, now automated with specialized tools. Previously, our "tools" were simple integrations, such as searching for an address on Google Maps. Now, Parcha tools handle complex yet decoupled parts of the process. For example, instead of having an AI agent figure out how to verify a business's registration, we use a sequence of tools that parse a PDF document, perform OCR on it, extract relevant information about the business, and check if it matches what the customer entered in the application. Each tool works standalone and can be reused in multiple workflows.

Focus

Since we no longer rely on an AI to determine a plan based on its SOP, we can only automate a process if we have a list of steps to perform it and the tools we need to do the job. Hence, we decided to hyperfocus on one area we knew well: Know Your Business/Customer (KYB/KYC). This allowed us to build intelligent workflows that could seamlessly integrate into our customers' existing processes, providing clear and measurable benefits quickly. We also placed a strong emphasis on the user experience (one of our values is to "Make it Dope!"), ensuring our product is easy to use and requires minimal integration and training. This new approach resonated well with our design partners, who, by this point, were paying customers.

Where Parcha is Today

The decision to focus on one workflow, executed as a simple set of steps, enabled us to build the best product for our customers. This strategic focus allowed us to explore and leverage generative AI more effectively. Today, we have LLM-powered workflows in over a dozen KYB and due diligence checks, covering document intelligence (e.g., proof of address verification), web-based due diligence (e.g., high-risk country/industry checks), and advanced search (e.g., adverse media, sanctions, and screening). These checks extensively use large language models and the concepts behind agentic behavior. For example, one Parcha KYB agent can perform more than a thousand LLM calls.

So, how do we use LLMs?

When we decided to focus, we also built our AI tools as decoupled, reusable blocks that we could extend and independently evaluate. We orchestrate these subsystems to perform tasks ranging from extracting specific pieces of information from documents to complex reasoning-driven functions like making an approve/deny decision-based on the output of various checks. We built abstractions and primitives for these checks. For example, a "data loader" is responsible for extracting information from one or more sources and structuring it appropriately, while a "check" uses that information to pass judgment. This modular approach allows us to evaluate every part of this pipeline independently. Did we extract the correct information from the document? Was the decision, based on the extracted data, correct? This approach ensures we can focus on each task without worrying about too many external factors.

For example, consider the task of screening news articles to determine if any of them mention a particular person of interest. Rather than building an agent that constructs a plan and autonomously fetches and analyzes each news article, we handle the process in a structured manner:

  1. Extraction: We first extract relevant information from each news article, including what happened, when it happened, who was involved, and other pertinent facts. Our extraction process handles formats like HTML, PDF, or plain text and extracts the information from one article at a time. We make tens or sometimes hundreds of concurrent LLM calls to an inexpensive model to extract this information (e.g., a person with a common name will have multiple hits with their name on them). Each language model call focuses on this task, guided by a thoroughly tested prompt.
  2. Clustering: Then, we use the extracted features to cluster the articles into "news events." A news event (e.g., reporting a corporate financial crime) may contain multiple news articles, some with unique information to help the judgment process. We use a more complex model for this process since clustering requires more reasoning than a simple extraction.
  3. Partial Judgement: Once we have news events and all the information surrounding them, we use a top-of-the-line model to determine if the person of interest is the perpetrator mentioned in the article based on their name, age, address, and other available information about their profile. The language model independently provides judgments on each article, ensuring that the reasoning process is constrained to limited and structured information without noise.
  4. Final Decision: Finally, we aggregate the partial judgments' results. We then synthesize these judgments into a final pass-or-fail decision (e.g., "we found adverse media about this person"), a detailed report about what we found, and a human-readable explanation of the thought process that customers can incorporate into their audit logs.

Benefits of this Approach

This approach, building on the agent "tool" (or what we call "check") as a primitive noun in our system, has enabled us to move faster while reliably delivering value to our customers.

Here are some of the benefits we have observed so far:

1. Modularity and Reusability: We can extend and independently evaluate each component by building AI tools as decoupled, reusable blocks. This modular approach allows us to maintain flexibility and adaptability while ensuring high reliability. Each subsystem, whether extracting information from documents or performing complex reasoning tasks, functions independently and can be upgraded or replaced without affecting the entire system.

2. Scalability and Parallelization: Our workflows can handle large volumes of records efficiently by parallelizing tasks. For example, we make tens or hundreds of concurrent LLM calls to inexpensive models to extract information in screening news articles. This ensures we can perform complex workflows that may take a human 30 minutes to an hour, in minutes.

3. Accuracy and Reliability: We independently evaluate each part of the workflow to ensure accuracy. For instance, the "data loader" fetches and extracts information, and the "check" uses that data to pass judgment. We have datasets and experiments in each part of the pipeline, so a team member can improve extraction, test, evaluate, and measure performance. At the same time, someone else does the same with the judgment component.

4. Cost-Effectiveness: Our approach allows us to optimize using LLMs based on task complexity. We use inexpensive models for more straightforward extraction tasks and reserve more robust models for complex reasoning and judgment tasks. This strategic use of resources helps us control costs while maintaining high-quality results.

5. Transparency and Auditability: We provide a clear and human-readable explanation of the thought process behind each decision. This transparency is crucial for audit logs and compliance reporting, ensuring our customers can trust and verify the results. The final aggregation and decision-making process includes detailed reports outlining the reasoning and data used, making the process transparent and auditable.

Where we think AI agents are useful

While compliance workflow automation may not be the ideal use case for AI agents today, they have potential in other areas. One successful application is in customer support, where AI agents answer questions based on a knowledge base, reducing the likelihood of failure. Companies like Intercom and Klarna have seen success with this approach. Additionally, AI agents can be effective when strong feedback loops are present, such as coding agents that must verify their code runs correctly before proceeding.

Conclusion

Our journey at Parcha has taught us several critical lessons about building production-ready products using AI. It has also taught us the value of moving quickly and not shying away from change. In just a few months, we completely changed our approach to focus on reliability and accuracy instead of autonomy, resulting in workflows that are now three times more reliable.

We recognized that building reliable agentic behavior with large language models (LLMs) would have been a massive endeavor, requiring most of our resources and focus. Yet we knew we could use LLMs and many of the concepts behind agents today and provide significant value to our customers.

Today, our focus on modular and scalable LLM-powered workflows has enabled us to deliver reliable, efficient solutions with unit economics that make sense to Parcha and our customers. We have accelerated our customers' largely manual processes and enabled them to grow faster, and since we can move fast, we can quickly improve our product based on their feedback. Ultimately, they did not ask us to build them an AI agent; they wanted us to solve a problem effectively and efficiently.

Thanks for reading The Hitchhiker's Guide to AI by Parcha! Subscribe for free to receive new posts and support my work.