What LLM frameworks are (and why they matter)
An LLM framework is a library that sits between your application code and one or more large language models. At minimum it wraps provider APIs; at maximum it takes responsibility for prompt construction, tool calling, memory, retrieval, typed output, multi-step agent loops, evaluation and observability. Most of the work of building a production LLM feature is plumbing, and frameworks exist to save you from writing that plumbing from scratch.
In 2026 the category has matured in two directions at once. The big names (LangChain, LlamaIndex) have split into modular sub-projects with dedicated cloud platforms. At the same time, a new generation of smaller, more opinionated libraries — Mastra, PydanticAI, smolagents, BAML — have arrived for teams who found the older stacks too heavy. Model vendors have also started shipping their own agent SDKs (OpenAI Agents SDK, Claude Agent SDK), which handle the loop for their specific models.
This page lists the frameworks worth knowing, groups them by what they are good at, and gives you a short, neutral take on each. All links go to the official project site.
The 2026 landscape at a glance
Rather than thinking of a single “best” framework, it helps to think in five overlapping layers:
- Agent frameworks — run the model-tool-model loop, handle planning and multi-agent coordination.
- RAG & data frameworks — ingest documents, chunk them, embed them, and retrieve relevant context at query time.
- Orchestration & general — chains, graphs, routers and the glue that connects prompts, tools and data sources.
- Typed output / structured generation — force the model to return valid JSON, Pydantic objects or schema-constrained strings.
- Lightweight / single-purpose — one-file libraries that do one thing well, such as provider routing or validation.
Most production stacks mix several layers. A common 2026 shape: LiteLLM for provider routing, LlamaIndex or Haystack for retrieval, Instructor or BAML for structured output, and LangGraph or a vendor Agent SDK for the control loop.
Comparison table — every framework at a glance
| Framework | Language | Type | Maturity | Licence | Best for |
|---|---|---|---|---|---|
| LangChain | Python / TS | Orchestration | Mature | MIT | General-purpose chains, widest integration surface |
| LangGraph | Python / TS | Agent / graph | Mature | MIT | Stateful, branching agent workflows |
| LlamaIndex | Python / TS | RAG | Mature | MIT | Ingesting, indexing and querying your own data |
| Haystack | Python | RAG | Mature | Apache 2.0 | Production search and QA pipelines |
| RAGFlow | Python | RAG | Growing | Apache 2.0 | Deep document understanding with OCR and layout |
| Verba | Python | RAG | Growing | BSD-3 | Drop-in RAG chatbot on top of Weaviate |
| Semantic Kernel | .NET / Py / Java | Orchestration | Mature | MIT | Enterprise .NET and Microsoft-stack apps |
| Vercel AI SDK | TypeScript | Orchestration | Mature | Apache 2.0 | TypeScript web apps, streaming UIs, edge runtimes |
| AutoGen | Python / .NET | Agents | Mature | CC-BY-4.0 / MIT | Multi-agent conversation, research-grade experiments |
| CrewAI | Python | Agents | Mature | MIT | Role-based multi-agent teams with ergonomic APIs |
| OpenAI Agents SDK | Python / JS | Agents | Stable | MIT | Agents against GPT models with handoffs and guardrails |
| Claude Agent SDK | Python / TS | Agents | Stable | MIT | Agents against Claude with tool use and subagents |
| Mastra | TypeScript | Agents | Growing | Apache / Elastic | End-to-end TS agent apps with workflows and evals |
| PydanticAI | Python | Agents | Growing | MIT | Typed Python agents with Pydantic at the core |
| smolagents | Python | Agents | Growing | Apache 2.0 | Minimal code-writing agents, Hugging Face ecosystem |
| Instructor | Python / TS / Go | Typed output | Mature | MIT | Pydantic-typed outputs from any major provider |
| Outlines | Python | Typed output | Mature | Apache 2.0 | Regex / grammar-constrained generation, local models |
| Guidance | Python | Typed output | Mature | MIT | Interleaving generation and control flow |
| BAML | DSL + Py/TS | Typed output | Growing | Apache 2.0 | Schema-first prompts with a dedicated DSL |
| LiteLLM | Python | Routing | Mature | MIT | One API for 100+ providers, cost and fallback control |
| Guardrails | Python / JS | Validation | Mature | Apache 2.0 | Input / output validation and safety policies |
| DSPy | Python | Programming | Mature | MIT | Declarative prompt programs you can optimise |
Agent frameworks
Agent frameworks run the loop: the model picks a tool, the tool returns a result, the model reasons about the result, and the loop continues until a task is finished. They differ in how much structure they impose, how multi-agent coordination works, and which models they are tuned for.
AgentExecutor interface is still widely deployed and well documented. Pairs with LangSmith for tracing and evaluations.RAG & data frameworks
RAG (retrieval-augmented generation) frameworks focus on getting the right context into the prompt. They handle document ingestion, chunking, embedding, vector storage, hybrid search, re-ranking and query pipelines. If your product is “chat with your documents”, you live in this layer.
Orchestration & general-purpose
Orchestration frameworks are the generalists. They give you chains, graphs, memory, tools, callbacks and integrations, so you can wire together prompts, retrievers, APIs and model calls into arbitrary flows. They tend to be the broadest and also the most opinionated about how an LLM app should be structured.
Typed output & structured generation
A large share of production bugs in LLM apps are shape problems: the model returns nearly-valid JSON that your parser trips over. This group of frameworks exists to force the model to produce output that conforms to a schema, either at the token level or by validating and retrying.
.baml files; the toolchain generates typed Python or TypeScript clients with deterministic parsing. Strong VS Code tooling.Lightweight / single-purpose
The last group is small libraries that do one thing and do it well. They intentionally do not try to be a framework; they slot into whatever stack you have.
Decision matrix — use X if you need Y
Chat with your documents
Start with LlamaIndex or Haystack. Add RAGFlow if your corpus is heavy on scanned PDFs, tables or layout-sensitive documents.
Build a TypeScript web app
Pick Vercel AI SDK for the client-facing layer and Mastra if you need agents, workflows and evals in the same codebase.
Multi-step autonomous agents
Use LangGraph for fine-grained control over state, AutoGen for conversation-style multi-agent research, or CrewAI for role-based teams with ergonomic APIs.
Stay close to the model vendor
Use OpenAI Agents SDK for GPT-family models or Claude Agent SDK for Claude. Both ship official primitives for tools, handoffs and guardrails.
Strict typed JSON out
Instructor for cloud models with Pydantic schemas; Outlines for regex or grammar-constrained generation on local models; BAML if you want a schema-first DSL.
One provider API, many models
Put LiteLLM in front of everything. It abstracts over 100+ providers behind the OpenAI API shape and adds retries, budgets and routing.
Enterprise .NET stack
Semantic Kernel is the clearest fit, especially alongside Azure OpenAI and existing Microsoft identity, logging and deployment plumbing.
Small, minimal agents
smolagents for code-writing agents; PydanticAI if you want the typed-Python feel of FastAPI for LLM apps.
Tighten safety and validation
Layer Guardrails on top of whatever agent or chain you use — it is framework-agnostic and can gate both inputs and outputs.
Optimise prompts systematically
Reach for DSPy. It is the clearest answer to “treat prompts as code that can be compiled and optimised against a metric”.
When not to use a framework
A framework is not free. You pay in dependencies, abstraction layers, performance overhead and the risk that a future breaking change ripples through your code. There are real cases where calling the provider SDK (or plain HTTP) is the better choice:
- Single-shot prompts or thin wrappers. If your feature is “send a prompt, show the answer”, the OpenAI, Anthropic or Gemini SDK plus a few lines of code will be shorter, faster and easier to debug than any framework.
- Simple chatbots with short memory. You can manage a message list and a system prompt in a few dozen lines; frameworks become necessary when tool use, retrieval or multi-step planning enter the picture.
- Prototyping with unusual providers or models. When you are experimenting with a new endpoint that lacks framework support, raw HTTP often wins on velocity.
- Latency-critical code paths. Heavy orchestration layers can add noticeable overhead in streaming chat; measure before committing.
- You already have strong internal libraries. Some mature teams have grown their own mini-framework that fits their codebase and domain better than anything off-the-shelf.
A useful heuristic: start with the vendor SDK. Move to a framework the first time you need retrieval, typed output, a tool loop, or a provider swap.
Frequently asked questions
An LLM framework is a library or toolkit that sits between your application code and a large language model. It handles prompt construction, tool calling, memory, structured output, retrieval, evaluation and multi-step agent loops so you do not have to write all of that plumbing against a raw HTTP API.
No. For single-shot prompts, simple chatbots or thin wrappers, calling the provider SDK directly is often faster and easier to debug. Frameworks pay off once you need agents with tools, retrieval over your own data, structured outputs, routing across providers or repeatable evaluations.
They overlap heavily in 2026 but have different centres of gravity. LangChain and LangGraph focus on general orchestration, agents and tool use; LlamaIndex is optimised for ingesting, indexing and querying your own documents (RAG). Many teams use both in the same stack.
There is no single winner. OpenAI Agents SDK and Claude Agent SDK are the most tightly integrated with their respective models; LangGraph and AutoGen target complex multi-step or multi-agent flows; CrewAI and Mastra prioritise developer ergonomics; smolagents and PydanticAI keep the surface area small.
Most are open source under permissive licences (MIT, Apache 2.0) and free to self-host. Some vendors offer paid hosted platforms on top — for example LangSmith for LangChain, LlamaCloud for LlamaIndex, or Mastra Cloud — but the core framework is typically free.
Yes, and it is common. A typical 2026 stack might use LiteLLM for provider routing, Instructor or BAML for structured output, LlamaIndex for retrieval, and LangGraph or an Agent SDK for the control loop. Framework-agnostic pieces like Guardrails and LiteLLM are specifically designed to slot in alongside others.