Business Automation8 min read20 January 2026

How to Run an AI Automation Pilot in 8 Weeks

A well-designed pilot answers the question the organisation actually needs answered: will this automation work reliably enough, at sufficient quality, to justify full deployment?

Ajay Prajapat

AI Systems Architect

The difference between a successful AI automation pilot and an unsuccessful one is not usually the technology. It is the design. A poorly designed pilot answers the wrong question ("can we build this?") and leaves the decision-makers without the data they need to make the go/no-go call. A well-designed pilot answers the right question: does this automation work reliably enough, at sufficient quality, to justify full deployment at the expected cost?

Designing the Pilot Before Writing Any Code

The most important pilot decisions happen before a line of code is written. Define: what specific hypothesis is being tested? What volume of real transactions will the pilot cover? What metrics will determine success? What is the go/no-go threshold for each metric? Who makes the go/no-go decision and when?

Hypothesis: "AI can process X type of documents with ≥92% field accuracy and ≥75% straight-through rate"
Volume: enough transactions to be statistically meaningful — typically 200-1,000 depending on variance
Duration: long enough to capture weekly/monthly variation patterns — typically 4-6 weeks of live processing
Success criteria: defined quantitatively before the pilot starts, not after you see the results
Evaluation methodology: who reviews accuracy? How are edge cases adjudicated? How is ground truth established?

The 8-Week Pilot Timeline

Weeks 1-2: Baseline and build

Measure the current-state process precisely. Build the minimum viable automation — just enough to process real transactions and produce measurable output. Do not optimise yet. The goal is a working system that produces results you can evaluate, not a polished product.

Weeks 3-6: Live processing and learning

Run the automation on real transactions in parallel with the manual process (not replacing it). Every automated output is checked against the manual process result. Log everything: inputs, outputs, confidence scores, accuracy, processing time. Review the results weekly. Fix clear bugs immediately. Do not over-tune — you want to understand the baseline capability, not the best-case capability.

Week 7: Analysis and failure mode review

Aggregate all pilot data. Calculate accuracy by document type, by field, by confidence band. Identify the systematic failure modes — the patterns of errors that repeat. Categorise: fixable before deployment (prompt tuning, validation rule addition, training data addition) vs structural limitations (the automation cannot reliably handle this document type).

Week 8: Go/no-go recommendation

Present the pilot results against the pre-defined success criteria. For each metric: did it meet the threshold? By how much? What would it take to close the gap for metrics that missed? Recommend: go to full deployment, go with scope adjustment, or no-go with documented rationale. The recommendation should be data-driven, not optimism-driven.

What a Pilot Must Not Be

A demo: a pilot uses real transactions, not hand-picked examples — cherry-picked inputs answer the wrong question
A prototype evaluation: the pilot tests production behaviour under real conditions, not prototype quality under ideal conditions
Open-ended: without defined success criteria and a specific end date, pilots drift and lose stakeholder confidence
Retrospectively evaluated: success criteria defined after seeing results are not success criteria

Back to all articles

Key Takeaways

Define success criteria quantitatively before the pilot starts — not after seeing the results
Use real transactions, not hand-picked examples — a pilot that cherry-picks inputs answers the wrong question
Run in parallel with the manual process for 4-6 weeks; measure accuracy of every automated output against the manual result
Review results weekly during the pilot; fix bugs immediately but do not over-tune for best-case performance
Categorise failure modes: fixable before deployment vs structural limitations
Present a data-driven go/no-go recommendation against pre-defined thresholds; optimism is not a methodology

Apply This To Your Business

Book a strategy call to discuss how these patterns apply to your specific systems and team.

Book a Call

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.

Book a Strategy Call