When we started working with large language models (LLMs), we checked out LangChain to see if it could help manage the complexity. LangChain is a solid framework used industry-wide because it makes switching between different models like OpenAI and Cohere easy, has handy tool integrations like Google Search, and there’s a wealth of tutorials and community resources. These features make it a go-to choice for many who need to streamline and automate LLM workflows.
However, as we began designing our intelligent test maintainer, we realized that LangChain, though powerful, wasn’t the right fit for our specific needs. Our system demanded customization and control that LangChain couldn’t offer without significant compromises. Adapting it to meet our requirements would have introduced unnecessary complexity and slowed our development process, making it feel like trying to fit a square peg into a round hole.
After carefully considering five critical questions, we decided to try something else: a custom LLM orchestration framework tailored specifically to handle the unique demands of our projects. This framework allows us to manage complex chains, customize agent interactions, and optimize performance precisely as needed. While the decision to build our own framework over using an existing solution like LangChain was risky, it was ultimately the best choice to keep our work fast, efficient, and ideally suited to our needs.
Many teams are in the same position we were. The five questions we asked ourselves can help teams in a similar position to ours determine if LangChain is right for them.
Before diving into a new framework, you need to make sure it aligns with your project’s specific needs. LangChain is a powerful tool, but it’s not a one-size-fits-all solution. Here are five yes/no questions to ask yourself to determine if LangChain is the right fit for your project,
The more “yes” answers you give, the more likely it is that LangChain will work for you.
If your application is rapidly expanding its number of prompt templates, consider how well LangChain's reliance on the Jinja templating engine impacts that growth. As your template library increases, keeping your codebase maintainable by a team of developers becomes even more critical. The more templates you have, the more string-template variables you have to manage and the higher the risk of type mismatches, leading to runtime errors that are difficult to trace and fix.
But type safety isn’t the only concern. Organizing, updating, and ensuring consistency across a large and expanding set of templates can quickly become overwhelming. Without proper tools and processes, you might encounter performance, debugging, and version control issues.
LangChain’s flexibility makes it easy to create and integrate new templates. Still, as the number of templates grows, keeping them organized and consistent can become challenging, leading to potential errors and increased maintenance effort. You’ll need to ensure the framework can scale with your needs, keeping your application stable and efficient as it grows. If LangChain doesn’t meet these requirements, you might find yourself facing increased overhead in managing and maintaining your template library, which could slow down your development process.
LangChain offers built-in telemetry features that provide basic tracking with minimal setup. If you prefer a low-overhead solution that gets you up and running quickly, LangChain’s standard telemetry might be a good fit. It handles basic essential monitoring without requiring extensive customization.
However, if you need more control over your telemetry—such as custom visualizations, alerting and anomaly detection, and a usable UX when dealing with large inputs like HTML files—you might find LangChain’s built-in tools limiting. Many teams turn to additional solutions, such as Helicone, to fill these gaps and provide more comprehensive monitoring and analytics capabilities. For projects that require deeper analysis or highly customized tracking, even LangChain’s paid options may not provide the level of control you need.
If your project requires complex control flows—"like hierarchical state representation or intelligent history summarization"—this is something to seriously consider when looking at LangChain.
LangChain is excellent for many workflows, but its built-in tools can be limiting when it comes to deep customization. Suppose your project needs agents to interact in complex ways or make dynamic decisions. In that case, LangChain’s default setup might not offer the flexibility you need, and you’ll need to evaluate whether LangChain can handle it without adding extra complexity.
LangChain’s tools work fine for workflows that don’t require significant customization.
If you’re experimenting with novel LLM agent architectures, choosing a framework that allows flexibility and easy adaptation is essential. LangChain’s powerful abstractions can simplify complex workflows, but they may also introduce challenges when working with cutting-edge LLM designs that require frequent changes and experimentation.
When your project is dynamic and involves exploring new LLM agent architectures, you need a codebase that can be easily modified to accommodate new ideas and approaches. While LangChain provides a strong foundation, its structured design can sometimes hinder the rapid iteration and deep changes often needed in experimental projects. Adapting LangChain’s components to fit novel architectures might require extra workarounds, potentially slowing down your innovation.
In a multi-agent system, several AI agents work together, often with complex interactions and handoffs. If your project needs this kind of setup, it’s important to understand how well LangChain can support it.
LangChain and its counterpart, LangGraph, can handle a variety of workflows, but multi-agent systems add layers of complexity that require deep customization. For example, if your agents need to communicate, share data, or manage tasks in a coordinated way, you’ll need to ensure that LangChain’s or LangGraph’s tools can handle these interactions smoothly. Without the proper support, managing these agents could become cumbersome, leading to potential bottlenecks or inefficiencies in your system.
One significant challenge we anticipated with LangChain was its reliance on a Jinja-inspired templating syntax for structuring prompts. While Python 3 offers optional type safety, this doesn’t extend to Jinja templates, where type mismatches can easily slip through. In our system, where potentially hundreds of prompts exchange data, even a tiny mistake can lead to cascading errors. As our engineers frequently refactor and tweak these prompts, we need to eliminate possible sources of mismatched types wherever possible.
Consider the following example, which highlights the potential fragility when using LangChain:
In this example, there's no type checking to ensure that variables like language
, text
, and aiOutput
are used correctly within the templating syntax. More importantly, your ability to refactor variable names throughout the codebase or ensure all required prompt variables are populated during compile time is severely stunted. In the code below, let's see how these two prompts will interact with each other.
If language
were passed as a number instead of a string, or if aiOutput
were undefined, these issues wouldn’t be caught until runtime, which could lead to difficult-to-trace errors.
This gap increases the risk of introducing bugs, especially when our team is constantly refactoring code to improve performance. In 2024, we expect our development environments to provide strong type safety throughout the entire workflow, catching these issues before they cause problems.
It’s true that LangChain supports live tracking and real-time telemetry data, but we needed it to do more. For example, by visualizing a histogram of session lengths, we can study the distribution better to understand the behavior of our AI agent in production. This real-time insight lets us fine-tune our operations, enhancing the customer experience.
To achieve this level of detailed tracking, we need to integrate custom telemetry at multiple points in our processes. LangChain doesn't provide the level of customization we require. Its abstractions make it challenging to embed telemetry deeply into every step of our workflow. For instance, monitoring and logging how an LLM’s output evolves in response to different inputs would require significant modifications or workarounds within LangChain’s existing structures. - We needed custom alerting ( e.g., sending a Slack message whenever a session exceeds budget). Furthermore, we needed business-logic-specific anomaly detection ( e.g., monitoring the number of times a specific prompt returns true or false and labeling anomalous time periods). LangChain didn’t provide us with an easy way to get these required features.
This limitation was a critical factor in our decision to build our own framework. Our custom telemetry implementation captures detailed data at every stage, giving us the real-time feedback necessary to continuously optimize our processes and ensure that each session runs as efficiently as possible.
Our most complex workflows may involve over 40 LLM requests to produce a single output. Each request must be carefully coordinated, as inputs like HTML and specific tasks pass through a series of operations—from code generation to critique—until we achieve the final result.
When we evaluated LangChain for orchestrating these complex sequences, we quickly realized it wasn't the right fit. While LangChain is excellent for more straightforward workflows, its structure would have made managing our extensive chains difficult and inefficient. We needed a solution that could keep each prompt as simple and performant as possible without being bogged down by unnecessary complexity.
That’s when we turned to the concept of the state monad. In functional programming, a state monad is a design pattern that allows you to pass state through a sequence of functions or operations in a controlled manner without accumulating unnecessary data along the way. This concept was perfect for our needs because it let us efficiently manage the information flow in our LLM chains in a type-safe way.
By implementing the state monad approach, we make sure that each prompt in our chain operates on only the relevant data, keeping the process streamlined and effective. This method allows us to maintain the performance of our LLM chains, ensuring that even the most complex workflows run smoothly.
We also learned that tweaking the representation of an agent's history significantly improves the prompt’s performance, especially for long-term sessions. There's no way we would use LangChain's default chat history representation. Customizing this to our level of satisfaction is best done without needing to abide by the peculiarities of any framework.
Staying at the forefront of AI research, we constantly explore the latest ideas from papers like OpenDevin, Tree-of-Thought, and Voyager to keep our work innovative and competitive. For us, being able to implement these new ideas quickly is not optional; it's essential. We are up against fierce competition and need to beat them to the punch.
When we looked at LangChain, we saw that its structure would make jumping straight into coding these fresh concepts difficult. The need to navigate LangChain’s abstractions would have slowed down our ability to keep up with the fast pace of AI advancements. We immediately knew we needed a solution that allowed for direct implementation, enabling us to experiment and innovate without barriers.
Of course, it’s possible to implement anything with LangChain, like Tree of Thought. In fact, LangChain offers robust parallelization and other experimental features. But just because you can do something doesn’t always mean you should.
When we set out to build our multi-agent system, we needed a framework that could transition smoothly from high-level overviews to detailed implementations while also allowing us to tailor every aspect to our needs. At first, we considered LangChain but quickly ruled it out. Its structured design seemed too rigid to manage the complex control flows we anticipated, especially in scenarios where agents need to interact in intricate ways, such as cyclic networks where the same agent might be called multiple times. We then moved to LangGraph (along with a few others), but it became clear that nothing on the market fully supported the approach we envisioned.
At QA Wolf, we use a layered abstraction approach—what we call L1, L2, and L3—to design and manage our AI systems. This method breaks down our complex workflows into three distinct levels:
Using this layered abstraction, we can manage complexity more effectively, allowing each level to focus only on the appropriate amount of information needed. This approach makes our systems easier to design, develop, and maintain.
L1 (Level 1) - High-level overview
This level gives a simplified, big-picture view of the system, highlighting how the main components interact without diving into the details. It helps us understand the overall structure and flow at a glance. You can think of it as a representation of a graph, like so:
This allows us to think about relationships between nodes,
L2 (Level 2) - Intermediate details
Here, we go deeper, focusing on the specifics of each component, such as input types, interfaces, and key interactions. But, we still avoid the low-level implementation details. L2 defines the "what" of the system—what each part does and how they interact.
L3 (Level 3) - Implementation specific
This is where we get into the detailed code, algorithms, and data handling. L3 covers the "how"—how each component operates at the technical level.In the case of our example, L3 would be the deeper details about the weather api, such as specifically what endpoint is being used, and how errors are handled, and API specific details such as rate limit quotas and other gotchas.
Our layered abstraction approach represents a shift in our thinking about AI system design. Rather than diving straight into the code or treating every detail with the same importance, we prioritize clarity and structure, separating concerns into manageable layers. This helps avoid information overload and, importantly, keeps the development process clear and efficient.
LangGraph, while a powerful tool, doesn’t align well with our layered abstraction model. It tends to blur the lines between these levels, mixing high-level concepts with low-level
const graph = {
agent: [
weatherTool, finish
]
}
const nodes = {
agents: [ agent ],
tools: [ weatherTool ],
controls: [ start, finish ],
}
What we want the code to look like
const workflow = new StateGraph(GraphState)
.addNode("agent", callModel)
.addNode("tools", toolNode)
.addEdge("__start__", "agent")
.addConditionalEdges("agent", shouldContinue)
.addEdge("tools", "agent");
vs. what it looks like implemented in LangGraph
Note that in the example above, the conditional edges in the LangGraph code are breaking the L1 abstraction. Ideally, such information belongs in L2. To us, the implementation on the right lacks cohesion.
Overall, as we moved through the different layers of our system—from high-level diagrams (L1) to more detailed views (L2 and L3)—we found that LangGraph’s API didn’t align with our thinking. It forced us to compromise on customization, which would have complicated our development process and removed the clarity and control we strive for. As we evaluated it, we realized it still didn’t provide the level of customization we required to work effectively across different abstraction levels.
For instance, LangGraph’s predefined node types—agents and tools—didn’t allow for the nuanced control we needed. We envisioned additional nodes like “modifiers” to adjust data focus, ”actions” to pause and interact with the environment, and “control flows” to manage agent handoffs. Implementing these in LangGraph would have involved significant workarounds, compromising the efficiency and customization we aimed for.
Ultimately, our focus on maintaining precise levels of abstraction and prioritizing customization led us to build our own framework. This approach allowed us to keep a consistent structure across all levels, from the big picture to the fine details, ensuring we had the flexibility we needed.
Our choice to develop a custom LLM Orchestration Framework was driven by a clear need for flexibility, precision, and the ability to manage complex workflows that existing frameworks like LangChain couldn’t handle. While LangChain provides excellent tools and resources, it wasn’t designed for our unique challenges. By building our own solution, we’ve ensured that our system stays fast, efficient, and perfectly aligned with our specific needs.
At QA Wolf, we’re not just committed to keeping up with AI advancements—we’re determined to lead the way. Our custom framework reflects our philosophy of questioning the status quo and building for a future that’s as adaptable as it is innovative.
We’re excited about the road ahead as we continue to push the boundaries of what’s possible in AI. Our focus remains on delivering the most effective, cutting-edge solutions, and we’re eager to see where our dedication to innovation will take us next.