TL;DR

  • Champion-challenger testing improves AI collections without replacing production
  • SA portfolio characteristics make ongoing model testing particularly valuable
  • Test design must be documented before the first account is assigned
  • SARB model governance requirements apply when a challenger becomes champion
  • Any element of collections decision logic can be structured as a challenger

A collections model has been running at a South African bank for ten months. It was built on 24 months of payment data, the bulk of it pre-pandemic. It performed well in the first two quarters after deployment. Recovery rates in the 30-60 DPD bucket improved. Cost per recovery came down.

In the last two quarters, the picture has changed. Payment rates in the mid-propensity band are lower than the model’s scores would predict. The data science team has built a new model trained on more recent data, incorporating post-pandemic payment behaviour patterns and updated signals for multiple credit agreement exposure. In backtesting on a holdout sample, the new model outperforms the current one across every DPD bucket.

The question the collections head and the chief risk officer are now sitting with is what to do with it. Replacing the production model based on backtest performance alone carries risk. A backtest runs on data where the outcomes are already known. Live portfolios behave differently. The South African credit market has specific characteristics that even a carefully constructed historical training window may not fully capture.

Champion-challenger testing is how the bank answers that question without taking on the full replacement risk. The current model stays in production as the champion. The new model runs on a controlled slice of the portfolio as the challenger. Both are evaluated against live outcomes over a defined window. The switch happens only when the data supports it and governance has signed off.

This blog covers what champion-challenger testing is, why SA portfolio characteristics make ongoing testing particularly valuable, how to structure a test that produces valid results, what can be tested beyond full model replacement, and what SARB requires when a challenger becomes the new champion.

What Champion-Challenger Testing Is

Champion-challenger testing is a controlled experiment in which the production model and a candidate replacement or variant run simultaneously on separate, randomly assigned account populations.

The champion continues to drive treatment decisions for the majority of the portfolio. The challenger runs on a defined test population, typically 10 to 20% of the relevant bucket, receiving its own treatment assignments based on its own scores or logic. Both populations are tracked on the same outcome metrics over the same test window. At the end of the window, the results are compared. The switch happens only when the challenger’s outperformance is statistically validated and governance-approved.

The structural difference from backtesting matters for anyone making a model replacement decision. A backtest evaluates a model against historical data where outcomes are already known. The model can be tuned to perform well on that specific data window. A champion-challenger test evaluates both models against live outcomes in the current SA portfolio environment, with current borrower behaviour, current economic conditions, and current credit market dynamics. Outperformance in backtesting that does not hold in live testing is a well-documented phenomenon. The champion-challenger structure catches this before the switch is made, not after.

The challenger does not have to be a full propensity model replacement. Any element of the collections decision logic can be structured as a challenger test:

  • A new propensity scoring model trained on a more recent or differently constructed dataset
  • A revised score band configuration, such as five bands instead of four, to test whether finer segmentation produces better treatment differentiation
  • A new channel sequencing strategy for a specific borrower segment or DPD bucket
  • A revised treatment intensity, such as three contact attempts in five days versus two in seven
  • A new self-cure suppression threshold to test whether suppressing more accounts reduces cost without affecting payment rate
  • A new message variant for a specific treatment tier to test whether a different tone or call to action produces better payment completion

Full propensity model replacement carries the highest governance overhead. Message variant and channel sequencing tests carry substantially lower overhead and often produce meaningful improvements faster. Both types belong in an ongoing challenger testing programme.

A backtest evaluates a model against historical outcomes that are already known. A champion-challenger test evaluates both models against live outcomes in the current environment. Only one of those tells you how the model will actually perform in production.

Why SA Portfolio Characteristics Make Ongoing Testing Valuable

Four characteristics of South African credit portfolios make champion-challenger testing more valuable than relying on annual model reviews or periodic manual recalibration.

Post-Pandemic Payment Behaviour Shift

South African consumer credit behaviour shifted materially during and after the COVID-19 period. Payment moratoriums, large-scale debt restructuring activity, and changed employment patterns created a cohort of borrowers whose payment behaviour at any given DPD stage differs from the pre-pandemic historical baseline. A collections model built predominantly on pre-2020 payment data may systematically misread the propensity of this cohort because the relationships between their observable characteristics and their payment behaviour were shaped by conditions that do not exist in the current portfolio.

A challenger model trained on a post-2021 rolling window can be tested against the champion specifically on the cohort whose credit behaviour was most affected by the moratorium period. If the challenger outperforms on this specific population, the evidence supports a targeted update or a full model replacement, depending on the size of the affected cohort.

Multiple Credit Agreement Exposure

South African borrowers frequently carry credit across banks, clothing retailers, furniture retailers, and micro-lenders simultaneously. The payment behaviour signals that predict self-cure for a borrower with a single active credit agreement predict different outcomes for a borrower managing five agreements and allocating cash flow across all of them. Cross-agreement deterioration, where two or more agreements show simultaneous DPD deterioration, is a materially stronger NPA prediction signal than single-account behaviour.

A challenger model that incorporates cross-agreement signals more effectively than the current champion can be tested against it on the multi-agreement borrower segment specifically, using a population split that oversamples this segment to ensure the test produces statistically valid results within a standard window.

Debt Review Cohort Dynamics

The population of borrowers entering, progressing through, and exiting debt review changes the composition of the collections portfolio over time. A champion model calibrated before a significant wave of debt review entries may score the current portfolio less accurately because the population distribution has shifted. Borrowers who have exited debt review and re-entered the collections pipeline carry different payment behaviour profiles from the general delinquency population. A challenger model that incorporates debt review history as a distinct signal can be tested on this re-entry cohort to determine whether it produces better outcomes.

Macro Sensitivity to Rand Volatility and Administered Prices

South African consumer credit performance responds to rand exchange rate movements, fuel price changes, and administered price increases in ways that affect disposable income and payment capacity faster than annual model reviews can capture. A model trained during a period of relative macroeconomic stability may underperform when consumer price pressure intensifies. A challenger incorporating more recent macro-sensitive signals can be tested against the champion during the specific period of pressure to generate live evidence of whether the adjustment produces better outcomes.

How to Structure a Valid Test for SA Collections

Five elements determine whether a champion-challenger test produces results that are actionable, statistically valid, and defensible under SARB’s governance framework.

Flowchart illustrating a champion-challenger testing framework for South African banks and credit providers, outlining five steps: Population Assignment, Test Window, Outcome Metrics, Statistical Validity Threshold, and Champion Switch Governance to evaluate and deploy higher-performing collections strategies.

Population Assignment

Accounts must be randomly assigned to champion and challenger populations at the account level, before any treatment is applied in the test window. No predictor variable used in the models should be used to split the populations. Assignment based on balance band, DPD range, bureau score, or any other characteristic that the models also use as a predictor will introduce selection bias that makes the results uninterpretable. The populations must be statistically equivalent on all observable characteristics at the point of assignment.

Population size must be sufficient to detect a meaningful difference at the chosen confidence level. For a South African retail bank or credit provider with a mid-size 30-60 DPD book, several hundred accounts per arm is a practical minimum for a payment rate comparison at 95% confidence, depending on baseline payment rates and outcome variance in the portfolio. Tests run on populations that are too small to detect the expected improvement will produce directional but inconclusive results, which are not grounds for a champion switch under SARB governance.

Test Window

The test window must be long enough to cover at least one full payment cycle. For monthly instalment credit, a minimum four-week window is required to observe whether the challenger’s treatment assignments produce payment completion outcomes, rather than only contact and response rate differences.

Test windows should avoid spanning month-end salary credit periods unless the challenger is specifically designed to improve performance in that window. The salary credit effect drives payment behaviour independently of the model’s predictive contribution. A test window that captures a month-end will produce payment rate results influenced partly by salary timing, making it difficult to isolate how much of the difference is attributable to the challenger model versus the salary calendar.

Outcome Metrics Defined Before the Test Runs

The primary outcome metric must be defined before the test begins. For a collections propensity model, the primary metric should be payment rate within the test window. Secondary metrics include right-party contact rate, PTP fulfilment rate, cost per recovery, and self-cure rate for accounts the challenger scores above the self-cure suppression threshold.

Selecting the success metric after seeing the results introduces the risk of choosing the metric on which the challenger happened to outperform rather than the metric that reflects genuine business value. SARB model governance documentation requires evidence that the success criteria were established before the test ran. A champion switch recommendation built on a metric selected post-hoc will not withstand scrutiny during a model risk review.

Statistical Validity Threshold

The significance threshold must be set before the test runs. For SA retail bank and credit provider portfolios of moderate size, a 95% confidence threshold is achievable within a standard test window with appropriate population sizing. The challenger’s outperformance on the primary metric must clear this threshold before a switch is recommended.

Directional outperformance that does not reach statistical significance is a common source of premature champion switches. The challenger appears to outperform across three weeks of live testing. The collections team recommends a switch. The governance process is abbreviated because the results look clear. The challenger underperforms in production for reasons the underpowered test population did not capture. The bank now has a production model that is performing worse than the one it replaced, and a governance record that does not explain why the switch was made without adequate statistical validation.

Champion Switch Governance

A champion switch is a model change under SARB’s model risk management framework. The documentation required before the switch is implemented includes the test design and population assignment methodology, the outcome metrics and statistical validity criteria that were set before the test began, the full test results with statistical validation, and sign-off from the model risk committee or equivalent governance body.

This documentation must be retained and available for SARB examination. Importantly, documentation from tests where the challenger did not outperform and no switch was made should also be retained. A pattern of unsuccessful challenger tests is governance-relevant information that demonstrates the bank is running a rigorous testing programme rather than promoting challengers whenever results appear directionally positive.

What Can Be Tested as a Challenger

Champion-challenger infrastructure in SA collections covers more ground than full propensity model comparisons. Five test types produce meaningful improvements across different time horizons and governance overhead levels.

Message Variant Testing

Two message versions for the same treatment tier: different tone, call to action, or payment link placement. Outcome metric: click-through rate and payment completion rate within a defined response window. This test does not constitute a model change. It carries the lowest governance overhead of any challenger test and can be run and concluded within two to three weeks. The results directly inform the message content used in the production treatment without requiring model risk committee sign-off for implementation.

Channel Sequencing Variant

WhatsApp-first versus voice-first as the lead channel for a specific borrower segment and score band. Outcome metric: right-party contact rate and payment rate within the test window. This test generates borrower-segment-specific evidence for channel strategy decisions rather than applying a general assumption about which channel performs better across the full portfolio. For South African credit providers with diverse borrower bases across income segments and regions, segment-specific channel evidence is more useful than a portfolio-level default.

Score Band Reconfiguration

A five-band score configuration tested against the current four-band configuration to determine whether finer segmentation produces better treatment differentiation and lower cost per recovery. This is a treatment matrix change rather than a model change, and carries moderate governance overhead. The outcome metric is cost per recovery and payment rate by band, measured across the full test window.

Self-Cure Threshold Adjustment

A higher self-cure suppression threshold tested against the current threshold. If suppressing more accounts from active outreach reduces cost per recovery without materially reducing the overall payment rate, the test produces evidence for a treatment policy update. The outcome metric combines cost per recovery and net payment rate across the suppressed and actively contacted populations combined.

Full Propensity Model Replacement

The highest-stakes challenger test: a new propensity model evaluated against the current champion across the relevant DPD bucket. Outcome metrics include payment rate, Gini coefficient improvement over the test window, and cost per recovery. This test carries the highest governance overhead and requires full model change documentation for any champion switch. It should be preceded by lower-overhead tests that validate the challenger’s contact and response rate improvements before committing to a full switch process.

Message variant and channel sequencing tests carry low governance overhead and often produce meaningful improvement in contact rates and payment completion without requiring a full model change process.

SARB Model Governance Requirements

SARB’s model risk management guidance, aligned with Basel Committee on Banking Supervision principles for model risk, applies to AI collections models and the champion-challenger testing process at South African banks.

Pre-Test Documentation

The test design, population assignment methodology, outcome metrics, and statistical validity criteria must all be documented before the test runs. This is a hard requirement for the documentation to be credible under SARB examination. A governance record that shows the test design was documented after the results were known does not demonstrate that the success criteria were set independently of the outcomes.

Independent Review for Material Tests

Material challenger tests, specifically full propensity model replacements or score band reconfigurations that affect the treatment of a significant portion of the collections portfolio, should be reviewed by the bank’s model validation function before the test is initiated. This review confirms that the test design is sound and that the success criteria are appropriate before any accounts are assigned to the test population.

Champion Switch Sign-Off

Formal governance sign-off from the model risk committee or equivalent body is required before any challenger is deployed as the new production champion. This sign-off should document the committee’s review of the test results, the statistical validation, and any conditions attached to the deployment, such as an enhanced monitoring period following the switch.

Vendor Model Governance

Where the champion or challenger is a vendor-supplied model, the bank retains full model validation and governance responsibility. SARB holds the institution accountable, not the vendor. The bank must own the test design, the outcome measurement, the statistical validation, and the governance sign-off process. A vendor that provides testing infrastructure or outcome reporting is supporting the bank’s process. The bank cannot delegate its regulatory obligations to the vendor.

Documentation Retention

All champion-challenger test documentation must be retained and available for SARB examination, including the documentation from tests where the challenger did not outperform and no switch was recommended. The testing programme’s full history, including unsuccessful tests, is part of the governance evidence that the bank is managing its collections model risk actively and rigorously.

A Better Model in Backtesting Is a Hypothesis. A Champion-Challenger Test Makes It Evidence.

Backtest outperformance is the starting point for a model improvement decision, not the ending point. It tells the data science team that the challenger is worth testing. It does not tell the risk committee that the challenger will outperform in the current SA portfolio environment, under current macroeconomic conditions, on the live borrower population that the production system is scoring today.

Champion-challenger testing converts the hypothesis into evidence. It does so without replacing the production model during the test, without applying an untested model to the full portfolio, and without creating a governance gap that surfaces during a SARB examination.

For South African banks and credit providers managing collections portfolios through a period of continued macroeconomic pressure, post-pandemic borrower behaviour adjustment, and an evolving regulatory framework, the discipline of ongoing challenger testing is what keeps the collections AI system improving rather than drifting.

Five markers of a well-run champion-challenger programme for SA credit providers:

  • Random account-level population assignment with no predictor variable used in the split, documented before the test begins
  • Test window covering at least one full payment cycle, avoiding month-end salary periods unless specifically testing for that window
  • Primary and secondary outcome metrics defined before the test runs, not selected after results are reviewed
  • Statistical significance threshold set before the test begins, with a documented minimum population size per arm
  • Full SARB model change governance documentation retained for every champion switch, and test records retained for tests where no switch was made

iTuring’s AI collections platform includes built-in champion-challenger testing infrastructure for SA credit providers, with configurable population splits, automated outcome tracking against pre-defined metrics, and native documentation output for SARB model change governance.