Resources > Collections & Recovery > Champion-Challenger Testing in...

May 21, 2026

Champion-Challenger Testing in AI Collections: A Practical Guide for US Banks

13 min read

Collections & Recovery, Model Risk & Validation

13 min read

TL;DR

Champion-challenger testing validates AI models before full deployment
Run live on a subset of accounts, not in simulation
SR 11-7 model risk guidance expects documented challenger testing
Results measure contact strategy, segmentation, and recovery outcomes
Continuous challenger testing prevents model drift going undetected

A collections manager at a mid-size regional bank has been running the same contact strategy for three years. Recovery rates on early bucket accounts have drifted down four percentage points over that period. The strategy still looks reasonable in the monthly report. The segmentation logic has not changed. The contact sequences are the same ones that worked in 2022. Nobody has flagged a problem because nobody has anything to compare the current performance against. The benchmark is the strategy itself.

That is the problem champion-challenger testing solves. The champion is the current production model or contact strategy. The challenger is an alternative you want to evaluate. You run both simultaneously on live accounts, assign borrowers to each at random, measure the outcomes side by side over a defined window, and let the results determine which one runs the portfolio going forward.

For US banks operating under SR 11-7 model risk management guidance, this is not an optional practice. OCC and Federal Reserve examiners expect documented evidence of challenger testing as part of model validation. A collections AI system making recovery decisions at scale without a documented champion-challenger programme is a model risk finding waiting to happen.

What Champion-Challenger Testing Is

The champion is whatever is currently running in production: the propensity model scoring accounts, the segmentation logic routing them to treatment strategies, or the contact strategy determining channel, timing, and message. The challenger is an alternative version of any one of those components that you want to evaluate against the current approach.

Both run simultaneously on live account subsets. Accounts are assigned to champion or challenger at random, or through stratified random assignment by risk tier. The split is typically 80/20 or 70/30 in favour of the champion, which preserves portfolio stability while generating enough challenger data to produce statistically meaningful results.

The critical distinction from backtesting is that champion-challenger runs on real borrowers making real decisions in real market conditions. Backtesting applies a model to historical data. It tells you how a model would have performed in the past. Champion-challenger testing tells you how borrowers respond to your approach right now, under current economic conditions, with their current financial circumstances. Those are different questions, and only one of them is relevant to what your collections operation does next.

[DESIGN: Callout box, “Backtesting tells you how a model would have performed in the past. Champion-challenger testing tells you how borrowers respond to your approach right now.”]

In AI collections, champion-challenger testing can be applied to three distinct components:

The predictive model: the algorithm scoring each account’s propensity to pay or likelihood of response
The contact strategy: the channel sequence, contact timing, frequency, and message framing
The segmentation logic: how accounts are grouped and routed to different treatment strategies

Each requires a different test design, a different outcome metric, and a different test window. The sections below cover all three.

Why SR 11-7 Makes This Non-Optional for US Banks

SR 11-7, the Federal Reserve and OCC joint guidance on model risk management, defines a model as any quantitative method used to apply business decisions. An AI collections model, whether it scores propensity to pay, segments accounts by risk tier, or determines the optimal contact strategy for each borrower, meets that definition.

SR 11-7 requires three things that champion-challenger testing directly satisfies:

Independent model validation before deployment
Ongoing performance monitoring after deployment
Documented evidence that the model performs as intended across changing conditions

Champion-challenger testing addresses the second and third requirements simultaneously. A live challenger running against the production model generates ongoing performance comparison data. That comparison data IS the documented evidence of ongoing evaluation that SR 11-7 requires.

What examiners look for specifically:

Evidence that alternative approaches were considered and tested before or alongside the champion
Documented comparison of champion and challenger outcomes on defined metrics
A clear, pre-specified governance process for deciding when and how a challenger is promoted to champion
Evidence that the team deciding on promotion is independent of the team that built the challenger

Banks that cannot produce this documentation during an OCC or Federal Reserve examination face model risk findings. Those findings trigger remediation requirements, examiner follow-up, and depending on the materiality of the model, potential capital implications. For an AI collections system making recovery decisions across a material portion of a consumer loan portfolio, the materiality threshold is typically met.

"A branded quote graphic with teal quotation marks and a teal-and-green corner border on a light cream background, featuring the text: 'A live challenger running against the production model generates the ongoing performance comparison data that SR 11-7 requires as documented evidence of model evaluation.' © iTuring.ai"

What to Test: Three Applications in Collections

Propensity Model Testing

The champion is the current model producing a propensity-to-pay score that drives account prioritisation. The challenger is a retrained version using updated features, a different algorithm, or a broader feature set.

Assign accounts to champion or challenger scoring at the start of the test period and hold that assignment constant for the full test window, typically 60 to 90 days for early bucket collections. Do not reassign mid-test.

Measure on:

Rank-order accuracy: does the model correctly rank high-propensity accounts above low-propensity accounts
Gini coefficient and KS statistic: standard discrimination metrics for scoring models
Recovery rate on accounts the model prioritised: the outcome metric that matters most operationally
Contact efficiency: how many contacts did the champion strategy require per dollar recovered versus the challenger

Contact Strategy Testing

The champion is the current contact strategy: lead channel, contact time, message framing, frequency. The challenger modifies one of those variables.

The most common contact strategy challengers in US collections currently involve:

Testing a digital-first sequence (email or SMS lead) against a voice-first sequence on early bucket accounts
Testing contact time variations based on predicted response windows
Testing message framing variants derived from behavioural segmentation

Measure on: right-party contact rate, promise-to-pay rate, payment completion within 7 and 30 days, opt-out rate, and complaint rate.

Design requirement: contact strategy challengers must be tested on accounts with comparable risk profiles, not on a random portfolio slice. If the champion and challenger populations have different underlying risk distributions, the results conflate contact strategy performance with account quality differences. Stratified assignment by risk tier prevents this.

Segmentation Testing

The champion is the current segmentation logic: how accounts are grouped and routed to different treatment strategies. The challenger is a revised segmentation model with new behavioural clusters, revised cut-points, or different variables driving segment assignment.

Measure on: treatment effectiveness per segment, segment stability over time, and recovery rate per segment compared to champion segment assignments on equivalent accounts.

Segmentation challengers require a longer test window, typically 90 to 120 days, because segment-level effects take longer to accumulate the volume needed for statistical significance at the segment level rather than the portfolio level.

How to Design a Statistically Valid Test

Four design requirements determine whether your champion-challenger results are actionable or noise.

Sample Size Calculated Before Launch

The test must be powered to detect the minimum improvement that would justify promoting the challenger. For a bank targeting a 2 percentage point improvement in 30-day recovery rate, the required sample size at 80% statistical power and a 5% significance level is calculated before accounts are assigned, not after results come in.

The most common failure in champion-challenger design is running a challenger on too small a sample, observing no statistically significant result, and concluding the challenger does not work. In many cases the test was simply underpowered to detect the effect size being targeted. The conclusion is not that the challenger failed. The conclusion is that the test was not designed to answer the question.

Stratified Random Assignment

Random assignment is the baseline. Stratified random assignment by risk tier is stronger. It ensures the champion and challenger populations are comparable on the dimensions that most strongly predict recovery outcomes, removing account quality differences as a confounding variable.

Assignment must be fixed for the duration of the test. Accounts assigned to the challenger stay in the challenger for the full window. Reassigning accounts mid-test because the challenger is performing differently than expected invalidates the results entirely.

Test Window Matched to Outcome Metric

The window must be long enough to observe the outcome being measured. If the test measures 30-day payment rate, the window is at minimum 30 days post-contact plus the ramp-up period during which contacts are being generated. If the test measures 90-day recovery rate, the window is correspondingly longer.

Running a two-week test and drawing conclusions on a 30-day outcome metric is a design error. The results will undercount payments that would have completed within the measurement window under normal conditions.

One Variable Changes Between Champion and Challenger

If the challenger changes the model, the contact channel, and the message framing simultaneously, a better result tells you something worked. It does not tell you which change drove the improvement, which makes the result impossible to act on systematically.

One variable changes. One variable is evaluated. The result is attributable and replicable.

The Governance Process and Continuous Testing

The promotion decision, moving a challenger from test to production, requires a documented governance process under SR 11-7. That process has four components.

Performance threshold criteria defined before the test begins. Document the specific metric, the specific threshold, and the required confidence level before any accounts are assigned. Post-hoc threshold setting, deciding after results are in what would constitute success, invalidates the governance process and will be noted by examiners.

Independent review of results. The function reviewing results and making the promotion recommendation must be independent of the team that built and ran the challenger. Model risk management or an independent validation function fills this role. The team that built the challenger has an interest in the result. The review function should not.

Complete documentation package. The package must include: test design and sample size rationale, assignment methodology and stratification approach, full results across all measured metrics for both champion and challenger, statistical significance calculations, recommendation with supporting rationale, and approval sign-off with date and role of the approving authority.

Continuous challenger programme after promotion. A single champion-challenger test at deployment does not satisfy SR 11-7’s ongoing monitoring requirement. A continuous challenger programme maintains a lightweight challenger running at all times on a small portfolio slice, typically 10 to 15% of accounts. This challenger is refreshed quarterly with the most recent model retrain candidate.

The continuous challenger serves two purposes. First, when the champion model begins to drift, the live challenger generates an immediate comparison point that quantifies the degradation and provides an alternative ready for evaluation. Without a live challenger, drift is detected only by monitoring the champion’s own metrics in isolation, which takes longer to surface and provides no immediate alternative. Second, the continuous challenger test record is itself the documented evidence of ongoing model evaluation that SR 11-7 requires. The testing programme and the monitoring programme are the same thing.

"A branded quote graphic with teal quotation marks and a teal-and-green corner border on a light cream background, featuring the text: 'The team that built the challenger should not be the team that decides whether to promote it. Independent review is an SR 11-7 expectation, not a suggestion.' © iTuring.ai"

Pre-Deployment Checklist

Before launching a champion-challenger programme on an AI collections model:

Test design documented before launch: metric, threshold, confidence level, and sample size calculation recorded
Stratified random assignment by risk tier confirmed and assignment method documented
Test window set to match the outcome metric being measured
Single variable change confirmed between champion and challenger
Independent review function identified, briefed, and confirmed as separate from the build team
Performance threshold criteria recorded before accounts are assigned
Documentation package template prepared and signed off by model risk management
Production deployment and version control process confirmed for challenger promotion
Continuous challenger cadence defined: portfolio slice size, refresh frequency, and monitoring review schedule

The Model That Has Nothing to Compare Against Is the Riskiest One

A collections AI system running without a challenger has no external reference point for its own performance. Recovery rates can drift, segmentation can decay, and contact strategies can become misaligned with current borrower behaviour, and none of it is visible until the degradation is large enough to show up in portfolio-level metrics.

Champion-challenger testing provides that reference point continuously. It also provides the documentation that SR 11-7 requires, the governance process that OCC examiners expect, and the mechanism for systematic improvement of collections performance over time.

The banks that improve recovery rates consistently are the ones running challengers continuously, evaluating results rigorously, and promoting improvements through a documented process. The banks that stay static are the ones comparing their current performance only to their own past performance, with no external benchmark in sight.

iTuring’s AI collections platform includes native champion-challenger testing infrastructure: stratified account assignment, parallel model scoring, side-by-side outcome tracking, and governance workflow for challenger promotion with full audit documentation.

See How It works for US Banks

Frequently Asked Questions

What is champion-challenger testing in AI collections?

Champion-challenger testing runs two versions of a collections model or contact strategy simultaneously on live account subsets. The champion is the current production approach. The challenger is an alternative being evaluated. Both run at the same time on randomly or stratified-assigned accounts, and outcomes are measured side by side to determine which performs better under real market conditions.

How is champion-challenger testing different from backtesting?

Backtesting applies a model to historical data to see how it would have performed in the past. Champion-challenger testing runs on live accounts in real market conditions and captures actual borrower behaviour in response to real contact attempts. Backtesting cannot replicate how borrowers respond to specific messages, channels, or timing under current conditions. Champion-challenger testing captures exactly that.

What does SR 11-7 require for AI collections model validation?

SR 11-7 requires independent model validation, ongoing performance monitoring, and documented evidence that models perform as intended across changing conditions. For AI collections models this means documented challenger testing, independent review of results, a governance process for model promotion decisions, and continuous monitoring after deployment.

How do you determine the right sample size for a champion-challenger test?

Sample size is calculated before the test begins based on the minimum improvement needed to justify promotion, the desired statistical power (typically 80%), and the significance level (typically 5%). Running the test first and calculating power afterwards produces unreliable conclusions.

How long should a champion-challenger test run in collections?

The window must be long enough to observe the outcome being measured. For a test measuring 30-day payment rates, the minimum is 30 days post-contact plus the ramp-up period. For write-off prevention or longer-cycle outcomes, the window is substantially longer. Two-week tests on 90-day outcome metrics produce unreliable results.

What does the challenger promotion process require under SR 11-7?

The promotion process requires: performance threshold criteria defined before the test begins, independent review by a function separate from the team that built the challenger, a full documentation package covering design, methodology, results, statistical significance, and recommendation, and a formal approval sign-off with date and role recorded.

How does continuous challenger testing work as a model monitoring practice?

A lightweight challenger runs at all times on 10 to 15% of accounts, refreshed quarterly with the most recent retrain candidate. When the champion model drifts, the challenger provides an immediate comparison point and a ready alternative. The challenger test record also serves as documented evidence of ongoing model evaluation under SR 11-7.

What happens if a US bank cannot produce champion-challenger documentation during an OCC examination?

Inability to produce documentation is a model risk management finding under SR 11-7. Findings trigger remediation requirements and examiner follow-up. For AI collections systems making recovery decisions at scale, the materiality threshold is typically met, making the documentation requirement consequential rather than administrative.

About the Author

Amit Kumar

Co-Founder & VP Product Engineering

Amit Kumar is Co-Founder and Vice President of Product Engineering at iTuring.ai.

He writes about building enterprise-grade AI infrastructure, designing platforms for reliability and scale, integrating AI with legacy banking systems, and the architectural decisions that separate proof-of-concepts from production-ready solutions.

Amit believes great engineering is invisible because it works, every time.

Share this resource

Latest Articles

May 15, 2026

What Most Banks Still Get Wrong About AI Fairness

AI Governance

5 min read

May 15, 2026

How Propensity Scoring Changes Debt Recovery: The AI Collections Advantage

Collections & Recovery

13 min read

May 11, 2026

NCA Section 86 Debt Review and AI: Early Identification for SA Credit Providers

Collections & Recovery

11 min read

See governance at work, not on slides.

In 15 minutes, walk through lineage, approvals, and traceability on a live flow for risk, fraud, collections, or growth – no decks, no pitch.

15

banks and insurers live

200

use case solutions

PLATFORM

INDUSTRIES

USE CASES

RESOURCES

COMPANY

Champion-Challenger Testing in AI Collections: A Practical Guide for US Banks

Table of Contents

What Champion-Challenger Testing Is

Why SR 11-7 Makes This Non-Optional for US Banks

What to Test: Three Applications in Collections

Propensity Model Testing

Contact Strategy Testing

Segmentation Testing

How to Design a Statistically Valid Test

Sample Size Calculated Before Launch

Stratified Random Assignment

Test Window Matched to Outcome Metric

One Variable Changes Between Champion and Challenger

The Governance Process and Continuous Testing

Pre-Deployment Checklist

The Model That Has Nothing to Compare Against Is the Riskiest One

What is champion-challenger testing in AI collections?

How is champion-challenger testing different from backtesting?

What does SR 11-7 require for AI collections model validation?

How do you determine the right sample size for a champion-challenger test?

How long should a champion-challenger test run in collections?

What does the challenger promotion process require under SR 11-7?

How does continuous challenger testing work as a model monitoring practice?

What happens if a US bank cannot produce champion-challenger documentation during an OCC examination?

About the Author

Amit Kumar

Co-Founder & VP Product Engineering

Table of Contents

Share this resource

Latest Articles

What Most Banks Still Get Wrong About AI Fairness

How Propensity Scoring Changes Debt Recovery: The AI Collections Advantage

NCA Section 86 Debt Review and AI: Early Identification for SA Credit Providers

See governance at work, not on slides.

15

200

Tarika Bhutani

Vipin Johnson

Rajnish Ranjan

Aishwarya Hegde

Bryan McLachlan

Mohammed Nawas M P

Amit Kumar

Valsan Ponnachath

Suman Singh