TL;DR
- Champion-challenger testing validates AI models before full deployment
- Run live on a subset of accounts, not in simulation
- SR 11-7 model risk guidance expects documented challenger testing
- Results measure contact strategy, segmentation, and recovery outcomes
- Continuous challenger testing prevents model drift going undetected
A collections manager at a mid-size regional bank has been running the same contact strategy for three years. Recovery rates on early bucket accounts have drifted down four percentage points over that period. The strategy still looks reasonable in the monthly report. The segmentation logic has not changed. The contact sequences are the same ones that worked in 2022. Nobody has flagged a problem because nobody has anything to compare the current performance against. The benchmark is the strategy itself.
That is the problem champion-challenger testing solves. The champion is the current production model or contact strategy. The challenger is an alternative you want to evaluate. You run both simultaneously on live accounts, assign borrowers to each at random, measure the outcomes side by side over a defined window, and let the results determine which one runs the portfolio going forward.
For US banks operating under SR 11-7 model risk management guidance, this is not an optional practice. OCC and Federal Reserve examiners expect documented evidence of challenger testing as part of model validation. A collections AI system making recovery decisions at scale without a documented champion-challenger programme is a model risk finding waiting to happen.
What Champion-Challenger Testing Is
The champion is whatever is currently running in production: the propensity model scoring accounts, the segmentation logic routing them to treatment strategies, or the contact strategy determining channel, timing, and message. The challenger is an alternative version of any one of those components that you want to evaluate against the current approach.
Both run simultaneously on live account subsets. Accounts are assigned to champion or challenger at random, or through stratified random assignment by risk tier. The split is typically 80/20 or 70/30 in favour of the champion, which preserves portfolio stability while generating enough challenger data to produce statistically meaningful results.
The critical distinction from backtesting is that champion-challenger runs on real borrowers making real decisions in real market conditions. Backtesting applies a model to historical data. It tells you how a model would have performed in the past. Champion-challenger testing tells you how borrowers respond to your approach right now, under current economic conditions, with their current financial circumstances. Those are different questions, and only one of them is relevant to what your collections operation does next.
[DESIGN: Callout box, “Backtesting tells you how a model would have performed in the past. Champion-challenger testing tells you how borrowers respond to your approach right now.”]
In AI collections, champion-challenger testing can be applied to three distinct components:
- The predictive model: the algorithm scoring each account’s propensity to pay or likelihood of response
- The contact strategy: the channel sequence, contact timing, frequency, and message framing
- The segmentation logic: how accounts are grouped and routed to different treatment strategies
Each requires a different test design, a different outcome metric, and a different test window. The sections below cover all three.
Why SR 11-7 Makes This Non-Optional for US Banks
SR 11-7, the Federal Reserve and OCC joint guidance on model risk management, defines a model as any quantitative method used to apply business decisions. An AI collections model, whether it scores propensity to pay, segments accounts by risk tier, or determines the optimal contact strategy for each borrower, meets that definition.
SR 11-7 requires three things that champion-challenger testing directly satisfies:
- Independent model validation before deployment
- Ongoing performance monitoring after deployment
- Documented evidence that the model performs as intended across changing conditions
Champion-challenger testing addresses the second and third requirements simultaneously. A live challenger running against the production model generates ongoing performance comparison data. That comparison data IS the documented evidence of ongoing evaluation that SR 11-7 requires.
What examiners look for specifically:
- Evidence that alternative approaches were considered and tested before or alongside the champion
- Documented comparison of champion and challenger outcomes on defined metrics
- A clear, pre-specified governance process for deciding when and how a challenger is promoted to champion
- Evidence that the team deciding on promotion is independent of the team that built the challenger
Banks that cannot produce this documentation during an OCC or Federal Reserve examination face model risk findings. Those findings trigger remediation requirements, examiner follow-up, and depending on the materiality of the model, potential capital implications. For an AI collections system making recovery decisions across a material portion of a consumer loan portfolio, the materiality threshold is typically met.

What to Test: Three Applications in Collections
Propensity Model Testing
The champion is the current model producing a propensity-to-pay score that drives account prioritisation. The challenger is a retrained version using updated features, a different algorithm, or a broader feature set.
Assign accounts to champion or challenger scoring at the start of the test period and hold that assignment constant for the full test window, typically 60 to 90 days for early bucket collections. Do not reassign mid-test.
Measure on:
- Rank-order accuracy: does the model correctly rank high-propensity accounts above low-propensity accounts
- Gini coefficient and KS statistic: standard discrimination metrics for scoring models
- Recovery rate on accounts the model prioritised: the outcome metric that matters most operationally
- Contact efficiency: how many contacts did the champion strategy require per dollar recovered versus the challenger
Contact Strategy Testing
The champion is the current contact strategy: lead channel, contact time, message framing, frequency. The challenger modifies one of those variables.
The most common contact strategy challengers in US collections currently involve:
- Testing a digital-first sequence (email or SMS lead) against a voice-first sequence on early bucket accounts
- Testing contact time variations based on predicted response windows
- Testing message framing variants derived from behavioural segmentation
Measure on: right-party contact rate, promise-to-pay rate, payment completion within 7 and 30 days, opt-out rate, and complaint rate.
Design requirement: contact strategy challengers must be tested on accounts with comparable risk profiles, not on a random portfolio slice. If the champion and challenger populations have different underlying risk distributions, the results conflate contact strategy performance with account quality differences. Stratified assignment by risk tier prevents this.
Segmentation Testing
The champion is the current segmentation logic: how accounts are grouped and routed to different treatment strategies. The challenger is a revised segmentation model with new behavioural clusters, revised cut-points, or different variables driving segment assignment.
Measure on: treatment effectiveness per segment, segment stability over time, and recovery rate per segment compared to champion segment assignments on equivalent accounts.
Segmentation challengers require a longer test window, typically 90 to 120 days, because segment-level effects take longer to accumulate the volume needed for statistical significance at the segment level rather than the portfolio level.
How to Design a Statistically Valid Test
Four design requirements determine whether your champion-challenger results are actionable or noise.
Sample Size Calculated Before Launch
The test must be powered to detect the minimum improvement that would justify promoting the challenger. For a bank targeting a 2 percentage point improvement in 30-day recovery rate, the required sample size at 80% statistical power and a 5% significance level is calculated before accounts are assigned, not after results come in.
The most common failure in champion-challenger design is running a challenger on too small a sample, observing no statistically significant result, and concluding the challenger does not work. In many cases the test was simply underpowered to detect the effect size being targeted. The conclusion is not that the challenger failed. The conclusion is that the test was not designed to answer the question.
Stratified Random Assignment
Random assignment is the baseline. Stratified random assignment by risk tier is stronger. It ensures the champion and challenger populations are comparable on the dimensions that most strongly predict recovery outcomes, removing account quality differences as a confounding variable.
Assignment must be fixed for the duration of the test. Accounts assigned to the challenger stay in the challenger for the full window. Reassigning accounts mid-test because the challenger is performing differently than expected invalidates the results entirely.
Test Window Matched to Outcome Metric
The window must be long enough to observe the outcome being measured. If the test measures 30-day payment rate, the window is at minimum 30 days post-contact plus the ramp-up period during which contacts are being generated. If the test measures 90-day recovery rate, the window is correspondingly longer.
Running a two-week test and drawing conclusions on a 30-day outcome metric is a design error. The results will undercount payments that would have completed within the measurement window under normal conditions.
One Variable Changes Between Champion and Challenger
If the challenger changes the model, the contact channel, and the message framing simultaneously, a better result tells you something worked. It does not tell you which change drove the improvement, which makes the result impossible to act on systematically.
One variable changes. One variable is evaluated. The result is attributable and replicable.
The Governance Process and Continuous Testing
The promotion decision, moving a challenger from test to production, requires a documented governance process under SR 11-7. That process has four components.
Performance threshold criteria defined before the test begins. Document the specific metric, the specific threshold, and the required confidence level before any accounts are assigned. Post-hoc threshold setting, deciding after results are in what would constitute success, invalidates the governance process and will be noted by examiners.
Independent review of results. The function reviewing results and making the promotion recommendation must be independent of the team that built and ran the challenger. Model risk management or an independent validation function fills this role. The team that built the challenger has an interest in the result. The review function should not.
Complete documentation package. The package must include: test design and sample size rationale, assignment methodology and stratification approach, full results across all measured metrics for both champion and challenger, statistical significance calculations, recommendation with supporting rationale, and approval sign-off with date and role of the approving authority.
Continuous challenger programme after promotion. A single champion-challenger test at deployment does not satisfy SR 11-7’s ongoing monitoring requirement. A continuous challenger programme maintains a lightweight challenger running at all times on a small portfolio slice, typically 10 to 15% of accounts. This challenger is refreshed quarterly with the most recent model retrain candidate.
The continuous challenger serves two purposes. First, when the champion model begins to drift, the live challenger generates an immediate comparison point that quantifies the degradation and provides an alternative ready for evaluation. Without a live challenger, drift is detected only by monitoring the champion’s own metrics in isolation, which takes longer to surface and provides no immediate alternative. Second, the continuous challenger test record is itself the documented evidence of ongoing model evaluation that SR 11-7 requires. The testing programme and the monitoring programme are the same thing.

Pre-Deployment Checklist
Before launching a champion-challenger programme on an AI collections model:
- Test design documented before launch: metric, threshold, confidence level, and sample size calculation recorded
- Stratified random assignment by risk tier confirmed and assignment method documented
- Test window set to match the outcome metric being measured
- Single variable change confirmed between champion and challenger
- Independent review function identified, briefed, and confirmed as separate from the build team
- Performance threshold criteria recorded before accounts are assigned
- Documentation package template prepared and signed off by model risk management
- Production deployment and version control process confirmed for challenger promotion
- Continuous challenger cadence defined: portfolio slice size, refresh frequency, and monitoring review schedule
The Model That Has Nothing to Compare Against Is the Riskiest One
A collections AI system running without a challenger has no external reference point for its own performance. Recovery rates can drift, segmentation can decay, and contact strategies can become misaligned with current borrower behaviour, and none of it is visible until the degradation is large enough to show up in portfolio-level metrics.
Champion-challenger testing provides that reference point continuously. It also provides the documentation that SR 11-7 requires, the governance process that OCC examiners expect, and the mechanism for systematic improvement of collections performance over time.
The banks that improve recovery rates consistently are the ones running challengers continuously, evaluating results rigorously, and promoting improvements through a documented process. The banks that stay static are the ones comparing their current performance only to their own past performance, with no external benchmark in sight.
iTuring’s AI collections platform includes native champion-challenger testing infrastructure: stratified account assignment, parallel model scoring, side-by-side outcome tracking, and governance workflow for challenger promotion with full audit documentation.


