Agents in Regulated Workflows

The FCA launched the Mills Review in January 2026 to examine how increasingly autonomous AI systems will reshape retail financial services by 2030. The review is asking a question that every financial services architect should already be answering: when an AI agent performs a function that looks like a regulated activity, who is accountable, and can you prove it?

This is not a theoretical concern. Agentic AI systems, those that can plan, execute multi-step tasks, and take actions with real-world consequences, are moving into production in financial services. They are routing customer queries, pre-populating advice suitability assessments, executing trades within parameters, and managing collections workflows. The design choices made now will determine whether these systems survive regulatory scrutiny or become the subject of it.

Why s.166 Is the Right Lens

A section 166 skilled person review is the FCA's sharpest investigative tool for examining how a firm's systems and controls actually operate. Unlike a thematic review, which examines an issue across the sector, a s.166 targets a specific firm and demands evidence: not what your policy says, but what your systems do in practice.

For agentic AI workflows, a s.166 review would examine three things. First, whether the firm can demonstrate that the AI system operates within its intended boundaries at all times, not just on average. Second, whether human oversight is genuine or performative. Third, whether the audit trail is complete enough to reconstruct any individual decision the agent made, including the reasoning path.

Designing for s.166 survivability is not about passing an exam. It is about building systems that are genuinely controllable, auditable, and explainable. The firms that treat regulatory compliance as a design constraint, rather than a post-hoc documentation exercise, build better systems.

Design Pattern 1: Bounded Autonomy with Hard Limits

The most common failure mode in agentic systems is scope creep: an agent designed to do one thing gradually being asked to do adjacent things, with each incremental expansion tested lightly or not at all. In the firms I've advised, this rarely happens as a single deliberate decision. It happens through a series of small accommodations. A product owner asks the agent to handle one more edge case, then another, until the system's effective scope has drifted well beyond what was originally tested and documented.

The pattern that works in regulated environments is bounded autonomy. The agent operates within a defined action space, with hard limits enforced at the infrastructure level, not just in the prompt or application logic. If an agent is authorised to send a collections letter, it cannot also offer a payment plan unless that action is separately authorised, tested, and documented.

In practice, this means:

Action whitelists, not blacklists. The agent can only do what is explicitly permitted. Everything else is denied by default.

Parameter boundaries enforced at the API layer. If the agent can adjust a credit limit, the maximum adjustment is enforced by the system, not by the model's judgement.

Escalation triggers that are deterministic, not probabilistic. When the agent encounters a scenario outside its boundary, it escalates. It does not attempt to reason its way through.

The firms getting this right separate the agent's reasoning capability from its execution capability. The model can think broadly. The execution layer constrains what it can do. This separation is what makes the system auditable: you can inspect the agent's reasoning and independently verify that the execution stayed within bounds.

Design Pattern 2: Genuine Human Oversight

The FCA has flagged human-in-the-loop protocols as a "live issue" and signalled that guidance is coming in 2026. The reason is straightforward: many firms claim human oversight of AI systems, but the oversight is nominal. A human "reviews" 200 AI-generated decisions per hour, which is not review. It is rubber-stamping.

Genuine human oversight in agentic workflows requires design choices that make oversight meaningful:

Sampling-based review with statistical rigour. Instead of requiring a human to approve every action (which degrades to rubber-stamping at scale), design a sampling framework where a defined percentage of actions are reviewed in depth, with the sample stratified by risk, novelty, and outcome.

Anomaly-triggered escalation. The system flags actions that are statistically unusual, not just those that breach hard limits. An agent that suddenly changes its behavioural pattern, even within its authorised boundaries, should trigger human review.

Delay buffers for consequential actions. For actions with significant consumer impact (issuing a default notice, declining a claim, adjusting a premium), build in a delay window during which the action can be reviewed before it takes effect. This converts real-time automation into near-real-time automation with a meaningful oversight window.

The test is simple: if a regulator asked your human reviewer to explain the last 10 decisions they approved, could they do so with specificity? If the answer is no, the oversight mechanism needs redesigning.

Design Pattern 3: Reconstructible Decision Trails

The audit trail for an agentic system is fundamentally different from a traditional model audit. A scoring model takes an input and produces an output. An agent takes an input, reasons about it, takes multiple actions, observes the results, adjusts its approach, and produces an outcome through a chain of decisions. Auditing that chain requires a different kind of logging.

The minimum standard for a s.166-survivable audit trail:

Full reasoning capture. Every step the agent considered, including options it rejected, must be logged. This is not the same as logging the final output. It requires capturing the agent's internal reasoning at each decision point.

State snapshots. The data the agent had access to at each step must be reconstructible. If the agent made a decision based on a customer's account balance at 14:32 on Tuesday, you need to be able to prove what that balance was.

Counterfactual capability. A reviewer should be able to ask: "What would the agent have done if this input had been different?" This requires the ability to replay the agent's reasoning against modified inputs, which in turn requires deterministic or near-deterministic behaviour from the underlying model.

This is expensive. Full reasoning capture increases storage costs and adds latency. State snapshots require point-in-time data architecture. Counterfactual replay requires model versioning and controlled inference. These are engineering costs that regulated firms must budget for. The alternative, deploying agents without adequate audit trails, is a cost that materialises later, in enforcement actions and remediation programmes.

The SMCR Question

The FCA's Mills Review raised a question that cuts to the heart of agentic AI governance: how does the Senior Managers and Certification Regime operate where AI systems perform functions traditionally subject to direct human oversight?

Under SMCR, a senior manager is personally accountable for the activities within their area of responsibility. When an AI agent performs those activities, the accountability does not transfer to the machine. It remains with the senior manager. A senior manager I work with in retail banking described this as the "accountability gap": they are personally liable for outcomes produced by a system whose inner reasoning they cannot directly inspect. Closing that gap is a design problem, not a governance problem. This means the senior manager must be able to demonstrate that they understood what the agent was doing, that they had adequate controls in place, and that they were in a position to intervene.

The design implication is clear: agentic systems in regulated workflows need dashboards, alerts, and reporting that are designed for the accountable senior manager, not just for the technology team. The senior manager needs to see, in near-real-time, what the agent is doing, whether it is operating within its boundaries, and what the key risk indicators look like.

Building for 2026 and Beyond

The FCA has not yet published specific rules for agentic AI. The PRA and Bank of England have consistently signalled that AI will be overseen through existing frameworks rather than bespoke AI-specific regulation. This means the design patterns above are not speculative. They are applications of existing regulatory expectations, SMCR accountability, adequate systems and controls, treating customers fairly, to a new class of technology.

The firms building agentic systems today have a choice: design for the regulator you have, or redesign when the regulator tells you to. The first approach is cheaper, faster, and produces better systems.

*To discuss how the 90-Day AI Acceleration programme can help your organisation design agentic AI systems for regulated environments, contact the Value Institute.*

Agents in Regulated Workflows

Why s.166 Is the Right Lens

Design Pattern 1: Bounded Autonomy with Hard Limits

Design Pattern 2: Genuine Human Oversight

Design Pattern 3: Reconstructible Decision Trails

The SMCR Question

Building for 2026 and Beyond

Related Insights

Measuring What Matters: How to Know If Your AI Investment Is Actually Working

The AI Strategy Myth: Why 'Just Add AI' Is Not a Strategy

Get insights delivered weekly