Traceable AI Workpapers in SOX 404(a). The Three Things External Auditors Will Examine.

TL;DR

External auditors deciding whether to rely on AI-generated workpapers under PCAOB AS 2201 examine three pillars. Evidence lineage (where the data came from), reviewer evidence (what the human actually examined), and reproducibility (will the procedure produce the same answer again). AS 1105 governs evidence relevance and reliability, and a workpaper missing any pillar gets redone at the auditor's billing rate.

The first time a Big 4 senior pulls an AI-generated workpaper into a reliance review, three questions decide whether the work counts as evidence or has to be redone.

Where did the data come from. What did the human reviewer actually examine. Will running the procedure again produce the same answer.

If the answers are clean, the external auditor can rely on management's testing under AS 2201, and the work stays in the file. If any of them are unclear, the work gets repeated by the external audit team at their billing rate, and management's investment in automation produces nothing for the audit cycle.

Financial dashboard with charts on a dark screen — Three questions decide whether AI-generated work survives the reliance review.

This piece is for SOX 404(a) teams (issuer-side SOX PMOs, controllers, internal audit) preparing for the first cycle where AI generated meaningful pieces of management's testing. The standards we anchor to are SEC Interpretive Release 33-8810 and COSO 2013. The external auditor's perspective comes in at the end because that is where the workpaper's value is decided.

Why traceability is now the binding constraint

Pre-AI, audit teams took workpaper traceability for granted. A junior analyst pulled a population, wrote a memo, and the chain of custody was implicit because a person remembered doing the work. Reviewers asked clarifying questions and got clarifying answers from the human who did it. The PCAOB AS 1215 documentation requirements were satisfied through a combination of artifact and human memory.

That model breaks when an agent produces the workpaper. The agent does not remember in the way a person does. If the workpaper does not capture the chain of custody explicitly, no one in the room can reconstruct it later. The implicit knowledge that used to make audit work reviewable is gone.

The fix is to make every step explicit. Three pillars carry the weight.

Three pillars of traceable AI workpapers — Evidence lineage, reviewer evidence, and reproducibility. Three columns hold up external auditor reliance.

Pillar 1. Evidence lineage

Lineage is the pedigree of every fact in the workpaper. For each item the agent touched, the workpaper should record five fields.

Source system. The system of record where the data originated. HRIS, ERP, IdP, ticketing system, change management tool, etc.
Query or extraction method. The exact query, API call, or report definition that produced the data. Not "pulled from Workday." The query itself.
Timestamp. When the data was extracted, with the time-period coverage clearly identified. SOX testing is period-bound, and timestamps prove the population covers the period.
Hash or version identifier. A fingerprint that identifies the exact data set used. If the source data changes, the hash changes, and the workpaper is now keyed to a different population.
Retention. Where the source data is stored, for how long, and under what access controls. The auditor needs to know the evidence will still be there when they look for it.

Five fields. Every workpaper has them or it does not. There is no halfway version of lineage.

This maps to the issuer-side framework in two places. 33-8810 directs management to evaluate the reliability of evidence as part of the assessment. COSO 2013's Information and Communication component requires that information used in controls be relevant, complete, and accurate, with controls over the information itself. Lineage is how management can show both.

It also maps to AS 1105 on the external auditor's side. The auditor evaluates the relevance and reliability of evidence used in the audit, and an AI-generated workpaper without lineage cannot be evaluated. The auditor's only option is to redo the work. That is the failure mode automation is supposed to prevent.

Source code on a dark monitor — Lineage is the pedigree of every fact in the workpaper.

Pillar 2. Reviewer evidence

Signoff is not review. PCAOB inspection findings have made this point about management review controls for years, and AI-generated workpapers make the gap harder to ignore.

Reviewer evidence is documentation of what the reviewer actually examined, not just that they signed off. The minimum capture is four fields.

Items opened. Which exceptions, population samples, and source records did the reviewer examine in detail.
Items challenged. What did the reviewer push back on. Which agent classifications did they question.
Items re-tested. Which conclusions did the reviewer ask the agent to redo, or re-perform manually.
Resolution. How was each challenge resolved. What evidence supported the resolution.

This is the operational answer to the MRC problem covered in The Management Review Control Problem in Agentic SOX Testing. When an agent prepares the workpaper, the human review of the agent's output becomes the operative MRC. The reviewer's burden of skepticism increases, and the documentation needs to reflect that.

Reviewer evidence is also what survives an external auditor's inquiry into the precision of management's review. Under AS 2201.42 through .45, the auditor evaluates the precision and competence of management's review controls. The recurring deficiency theme in PCAOB inspection reports is precisely this gap. A signoff with no evidence of substantive review fails this evaluation. Reviewer evidence captured in the workpaper passes it.

Pillar 3. Reproducibility

A workpaper that produces a different conclusion when you run the procedure again is not a control. It is a sample of what the system happened to produce that day.

Reproducibility is the property that the same evidence, the same rules, and the same procedure produce the same conclusion every time. For agent-prepared workpapers, this requires three things.

Deterministic execution. Where the agent's reasoning is non-deterministic (most LLM-based reasoning is), the workpaper must capture the inputs, rules, prompts, and outputs at the moment of execution. Re-running with the same inputs and rules should produce the same conclusion within a tight tolerance.
Versioned rules. The rules the agent applied (control logic, exception thresholds, classification criteria) must be versioned and stored with the workpaper. If the rules change, the workpaper records which version was used.
Immutable execution log. A record of what the agent did, in what order, with what inputs and outputs. Not a marketing trace. A log a reviewer or external auditor can replay.

Note the careful word choice. Reproducibility is not "reperformance" in the AS 2201 sense. AS 2201 reperformance refers to the external auditor re-executing the underlying control to obtain audit evidence, which is a specific procedure with specific evidentiary weight. Reproducibility here is the more basic property that the agent's procedure is deterministic enough to be reviewable in the first place.

This pillar is the one most vendors gloss over. It is also the one most likely to fail in production. If a vendor cannot show you a reproducibility test on their own system, they have not solved this.

Stock chart on a tablet screen — Reproducibility turns a result into a control.

What breaks when the external auditor reviews

Even with all three pillars in place, AI-generated workpapers will be examined harder than human-prepared ones during the external audit. Three specific places where reliance can break.

SOC 1 on the AI tool itself. When management uses a service to prepare workpapers the external auditor will rely on, that service is a service organization. Under SSAE 18, the external auditor evaluates whether to rely on a SOC 1 Type 2 report on the service organization's controls. If the AI tool does not have a SOC 1 Type 2, the external auditor either has to perform additional procedures over the tool's controls or cannot rely on the management testing produced through it. This is a procurement question, and it is the subject of SOC 1 for AI Tools in the SOX Workflow.

Complementary user entity controls. A SOC 1 report identifies controls the service organization expects the user entity (the issuer using the AI tool) to implement. If management does not implement the CUECs, SOC 1 reliance breaks. Common CUECs for an AI workpaper tool will include access controls over the tool, evidence approval workflows, reviewer training, and oversight of agent rule changes. Plan for them.

ITGC dependency stack. Any application control depends on the ITGCs of the underlying application. If the AI tool consumes data from upstream systems, the reliability of that data depends on the ITGCs of those systems. The external auditor will trace the dependency stack. If an upstream ITGC has a deficiency, controls and workpapers downstream are affected. The AI tool does not insulate management from this.

Buyer questions for vendor evaluation

When evaluating an AI workpaper vendor, the questions that separate real systems from demos are concrete.

Show a workpaper with the five lineage fields populated for every item.
Show the reviewer evidence model. What does the system record about reviewer behavior beyond signoff.
Run the same procedure twice on the same evidence. Show the outputs are identical or explain the variance.
Provide your SOC 1 Type 2 report. If you do not have one, explain when you will, and how customers handle external audit reliance in the meantime.
Document the CUECs in your SOC 1. Walk through what management has to implement for reliance to hold.
Provide an example of an ITGC dependency map. Show how the workpaper records its dependencies on upstream system controls.

A vendor who answers these crisply has built for the external audit review. A vendor who deflects has built for the demo.

What good looks like in 2026

The maturity curve for AI workpapers will close fast. By the end of 2026, the leading tools in this category will publish lineage, reviewer evidence, and reproducibility specs the way SaaS companies publish security pages today. The buyers who internalize what external auditors will examine now will pick the tools that survive the next inspection cycle. The buyers who do not will discover the gap when their external auditor declines reliance.

Traceability is a procurement requirement. It is the difference between automation that pays for itself in audit cycles and automation that creates more rework than it eliminates.

See Arden run this in production

Arden ships every workpaper with the five lineage fields populated, a structured reviewer evidence pane, and a reproducibility log. Every conclusion traces back to source evidence with the exact field, file, and timestamp behind it.

Book a 30-minute walkthrough

The Management Review Control Problem in Agentic SOX Testing. Where 33-8810 stops and the human reviewer's MRC begins.
SOC 1 for AI Tools in the SOX Workflow. The procurement gate that decides whether your external auditor can rely on AI-prepared work.

FAQ

Is "audit-ready" a meaningful term. Not really. PCAOB inspects firms, not software. There is no certification a tool can hold that makes it audit-ready in any standards-recognized sense. The right question is whether the tool is designed to support management's documentation expectations under 33-8810 and to survive the external auditor's reliance assessment under AS 2201. Specific. Falsifiable. Useful.

Does an AI workpaper need to be reproducible to be valid. Yes. A workpaper whose conclusion changes when re-run is not evidence of a control. It is a sample of what the system happened to produce that day. Both management and the external auditor lose the ability to rely on it.

What is a CUEC and why does it matter for AI workpapers. A complementary user entity control is a control the service organization expects the customer to implement so that the service organization's controls are effective. In an AI workpaper context, common CUECs include managing user access to the tool, approving evidence sources, training reviewers, and overseeing changes to the agent's rules. CUECs that are not implemented break SOC 1 reliance.

Can a SOC 2 substitute for a SOC 1. No. SOC 2 reports on the trust services criteria (security, availability, processing integrity, confidentiality, privacy). SOC 1 reports on controls relevant to financial reporting. An AI tool used to prepare SOX workpapers needs SOC 1 because the controls that matter are the ones over financial reporting evidence. SOC 2 is necessary for security review. It is not sufficient for SOX reliance.

If the AI tool produces a deviation, has the control failed. A deviation is an instance where the control did not operate as designed. Whether the control has failed in the sense of being a deficiency depends on the evaluator's analysis of likelihood and magnitude. The AI tool surfaces deviations. The evaluator concludes on deficiencies. The taxonomy comes from AS 2201.A3 through A7 and SEC guidance, and any vendor who blurs deviation and deficiency is a vendor to avoid.