The Management Review Control Problem in Agentic SOX Testing.

TL;DR

SEC Interpretive Release 33-8810 preserves the evaluator's judgment for a human and treats it as non-delegable. Agentic AI can prepare populations, apply rules, and draft exception write-ups, but the human review of the agent's output now becomes the operative Management Review Control under PCAOB AS 2201. Vendors that blur this line ship inspection findings, not automation.

If an AI agent pulls the population, runs the test, drafts the exception write-up, and produces a workpaper, who is the preparer of record, and what does the reviewer actually review.

Glass-and-steel office tower from below, signaling the corporate audit environment — Where the reviewer's signature ultimately lands.

This is the question every SOX PMO will face in the next two years, and most agentic AI vendors are not answering it. The reason it matters is concrete. Management review controls have been among the most frequently cited deficiencies in PCAOB inspection reports for nearly a decade. Adding an agent to the workflow without resolving the question deepens the problem.

This piece is for issuer-side teams who are evaluating where agentic AI fits inside a SOX 404(a) program. The standards we anchor to are SEC Interpretive Release 33-8810 and COSO 2013. Both preserve evaluator judgment as central to the assessment, and that is exactly where agents stop being useful.

What 33-8810 actually requires

When the SEC issued Interpretive Release 33-8810 in June 2007, it gave management a framework for assessing internal control over financial reporting under SOX 404(a). Three things in that framework matter for agentic AI.

First, the assessment is risk-based. Management evaluates controls in proportion to the risk that a failure would result in a material misstatement.

Second, evidence requirements scale with risk. Higher-risk areas require more persuasive evidence. Lower-risk areas can rely on lighter procedures.

Third, and most relevant here, the evaluator's judgment is not delegable. 33-8810 explicitly preserves judgment for the person responsible for the assessment. Whether a control deficiency exists, whether it is significant, whether it rises to a material weakness, all of these are judgments rooted in the likelihood and magnitude of potential misstatement. They depend on facts the agent does not have and considerations the agent cannot weigh.

This is not a limitation of current model capability. It is a structural feature of how the SEC chose to design management's assessment. An agent that produces a control conclusion is supplying an input to the evaluator's judgment. It is not supplying the judgment.

Why management review controls have been a chronic inspection finding

The MRC problem predates AI by a decade. PCAOB inspection reports across the Big 4 have flagged the same pattern repeatedly under the precision and evidence-of-review expectations in AS 2201.42-.45. A reviewer signs off on a report, a reconciliation, or an analysis prepared by someone else, and inspectors cannot find evidence the reviewer actually examined the right things at the right level of precision.

The recurring deficiencies fall into four buckets.

Precision. The reviewer did not examine the analysis at a level of detail sufficient to detect a material misstatement. They reviewed totals when they should have reviewed line items.
Evidence of review. The signoff exists, but there is no documentation of what the reviewer looked at, what they questioned, or what they followed up on.
Threshold and criteria. The reviewer did not document the threshold for "what counts as anomalous" or the criteria they applied to identify items requiring additional procedures.
Source data reliability. The reviewer accepted the underlying data without evidence that the population was complete or the inputs were reliable. This is the AS 1105 relevance-and-reliability test applied at the management-review layer.

Every one of these gets harder when an agent is in the loop. If an agent prepared the underlying analysis, the reviewer's burden of skepticism increases. The reviewer is no longer reviewing a junior analyst's work. They are reviewing a system whose error modes they may not understand.

Auditor reviewing documents at a quiet desk — The reviewer is the operative control. The standard does not move.

What agents can do, and where they stop

The honest answer is that agents extend the preparer's reach. They are very good at preparation work, including some work that has historically required senior preparers.

Where agents perform well today.

Population assembly and reconciliation. Pulling source data from upstream systems, reconciling counts, identifying gaps, and documenting the source query.
Rule-based exception identification. Applying defined rules to a population and flagging items that violate them. SLA breaches, missing approvals, out-of-policy access, and similar deterministic checks.
Procedural documentation. Drafting a description of the procedures performed, including timestamps, source systems, queries run, and evidence captured.
First-pass classification. Sorting items into "appears to comply," "needs human review," and "appears to violate the rule."

These are real capabilities and they free up significant preparer hours.

Where agents stop being useful.

Severity evaluation. Whether a deviation is a control deficiency, a significant deficiency, or a material weakness depends on likelihood and magnitude of potential misstatement, the framework in AS 2201.A3 through A7. The agent does not have the financial reporting context, the entity-level mitigation, or the qualitative considerations to make this call.
Scope and risk judgment. Whether the right population was tested, whether the control as designed addresses the relevant risk, and whether the level of precision matches the risk are evaluator decisions.
Compensating control analysis. When a control fails, deciding which other controls might mitigate the residual risk requires understanding the control environment as a whole.
Conclusion on operating effectiveness. Aggregating individual test results into a conclusion about whether the control operated effectively over the period requires weighing the pattern of exceptions, root cause, and management response.

These are the things 33-8810 reserves for the evaluator. An agent that crosses this line is no longer extending the preparer's reach. It is supplying judgment the standard requires from a human.

The MRC paradox with agents in the loop

Here is the paradox vendors are not addressing. When an agent prepares the analysis, the human review of the agent's work becomes the operative management review control. The reviewer is the person whose judgment is being relied upon for the assessment, and the reviewer's review of the agent's output is now the control that needs to survive scrutiny.

This means the standards that get applied to MRCs (precision, evidence of review, threshold and criteria, source data reliability) now apply to the human review of the agent. Three implications follow.

The reviewer needs evidence of what the agent did, at the level of precision required by the underlying risk. "The agent produced this report" is not enough. The reviewer needs the agent's evidence chain, the rules applied, the items flagged, and the items the agent classified as compliant but the reviewer should reconsider.

The reviewer needs to document what they examined, challenged, and resolved. A signoff on the agent's output without evidence of substantive review is the same rubber-stamp problem PCAOB has cited for a decade, with an agent prepping the work instead of a junior analyst.

The reviewer needs to understand the agent's failure modes well enough to design the review around them. If the agent reliably misses a particular class of exception, the human review needs to be designed to catch that class. This is a competency problem the reviewer cannot delegate to the vendor.

Open laptop on a clean wooden surface — An agent extends the preparer's reach. It does not extend the evaluator's.

A practical framework for buyers

For SOX PMOs evaluating agentic AI, the question is not "can the agent run the test." The question is whether the agent's output is structured to support an MRC that will survive inspection. Five requirements follow.

Evidence lineage. Every conclusion the agent reaches must be traceable to source data, with the source system, query, timestamp, and retention captured. The reviewer should answer "where did this come from" without leaving the workpaper.
Rule transparency. The rules the agent applied must be documented in a form the reviewer can challenge. "The agent flagged this as an exception because rule X was violated" is the minimum standard.
Reproducibility. Re-running the agent on the same evidence with the same rules should produce the same conclusion. If the conclusion drifts across runs, you do not have a control. You have a guess.
Reviewer evidence capture. The system must record what the reviewer actually examined. The signoff alone is not enough. Which items did they open. What did they re-test. What did they ask the agent to redo.
Failure mode disclosure. The vendor should be able to tell you where the agent reliably underperforms. If they cannot, the reviewer cannot design around it, and the MRC has a hole.

A vendor that cannot meet these five requirements is selling a productivity tool, not a control. We unpack each of these requirements (lineage, reviewer evidence, reproducibility) in Traceable AI Workpapers in SOX 404(a), with worked examples for each.

What this means for SOX programs

Agentic AI is going to change SOX testing. It is already changing the preparation half of the workflow, and the productivity gains are real. The mistake is assuming the gains extend into the evaluation half. They do not, and 33-8810 is the reason.

SOX PMOs who get this right will treat agents as a force multiplier for preparers and a structured input to the evaluator's review. They will design their MRCs to assume an agent is in the loop, with evidence requirements that make the human review substantive. They will pick vendors who can show the five requirements above on day one of evaluation. The COSO 2013 Information and Communication component is where the requirements live conceptually, and 33-8810 is where the assessment obligation does.

PMOs who get this wrong will end up with workpapers that look automated and feel fast, but cannot survive a senior auditor pulling a sample. The inspection findings will look familiar. They will be the same MRC findings PCAOB has been writing for ten years, with a different prep tool.

The boundary is the management review control. That is where automation stops, and that is where buyers should look hardest before signing anything.

See Arden run this in production

Arden is the audit intelligence engine for SOX 404(a) testing. Agents pull evidence across your stack, run every test in your plan, and produce workpapers structured for external auditor reliance under AS 2201.

Book a 30-minute walkthrough

Traceable AI Workpapers in SOX 404(a). The three things external auditors will examine, with worked examples of evidence lineage, reviewer evidence, and reproducibility.
SOC 1 for AI Tools in the SOX Workflow. Why service organization reliance is the next procurement gate for AI in audit.

FAQ

Does 33-8810 prohibit using AI in SOX testing. No. 33-8810 is technology-neutral. It permits any approach to gathering evidence and performing procedures, as long as the evaluator's judgment on design and operating effectiveness is preserved. AI is a tool inside the framework. The framework still applies.

If the agent prepares the workpaper, who signs as preparer. The human who directed the agent and reviewed its output before submitting it for review. The agent is not a person of record under any current standard. Signoff still requires a human preparer who can be held accountable for the work.

Does an agent-prepared workpaper require a more rigorous review. In most cases yes. The reviewer is now the operative MRC and inherits the precision and evidence-of-review obligations that have been the recurring inspection finding for years. Plan for a heavier review.

What is the difference between a deviation and a control deficiency. A deviation is an instance where the control did not operate as designed. A control deficiency is the evaluator's conclusion that the control as designed or operated does not adequately prevent or detect a misstatement on a timely basis. Agents can identify deviations. Only the evaluator can conclude on deficiencies.

Can an agent ever supply the evaluator's judgment. Not under 33-8810 as written. The evaluator is the person responsible for the assessment, and that assignment of responsibility is what makes the conclusion management's. The tool cannot inherit it. An agent can supply structured input, documentation, and analysis. It cannot supply the conclusion.