Cross-validation and honest selection

How honestml splits your data and how the winner is chosen. Every python block on this page is self-contained: copy any one of them and it runs as-is — and every block is executed on each CI run, so the examples cannot rot. The examples use the lightweight models=("baseline", "linear") so they finish in seconds; everything shown applies unchanged to the boosting models.

Picking a CV scheme

cv accepts an integer (the number of folds of the task's default scheme) or a CVConfig. The default scheme="auto" resolves to stratified k-fold for classification and plain k-fold for regression; "holdout" is a single shuffled split. The full menu is "kfold" / "stratified" (i.i.d. shuffled folds), "group" (repeated entities), "holdout" (one split), and the time-ordered "timeseries" (row windows) and "timeseries_period" (calendar / Δt windows) — each covered below. An unimplemented scheme or invalid parameter fails fast at fit, never silently falls back.

from sklearn.datasets import make_regression

from honestml import AutoML, CVConfig

X, y = make_regression(n_samples=200, n_features=6, noise=0.3, random_state=0)

model = AutoML(
    task="regression",
    models=("baseline", "linear"),
    cv=CVConfig(scheme="kfold", n_splits=3),
    random_state=0,
).fit(X, y)

print(model.best_model_id_, model.leaderboard_)

Group-aware CV

With scheme="group", rows that share a group label never span train and test — the leakage guard for repeated entities (a customer with many rows, a patient with many visits). Group labels are row-aligned metadata passed to fit, not a feature; classification uses the stratified-group variant automatically. If a groups= column is present but the scheme is not group-aware, honestml warns about the leakage risk instead of silently accepting it.

import numpy as np
from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=240, n_features=8, n_informative=5, random_state=0)
groups = np.arange(240) // 4  # 60 entities, 4 rows each

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(scheme="group", n_splits=3),
    random_state=0,
).fit(X, y, groups=groups)

print(model.best_model_id_)

Time-series CV with purge and embargo

scheme="timeseries" orders rows by the value of the time= column and scores on expanding-window folds — train always precedes test. Sizes are in rows of the time-ordered data: n_test is the test-window size per fold, purge drops rows right before each test window and embargo skips rows right after earlier test windows, so overlapping or delayed labels cannot leak across the split. When labels mature over an interval, pass label_time= (the label end time) for the full purge. A shuffling scheme over data that has a time axis triggers a look-ahead warning.

import numpy as np
from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=240, n_features=8, n_informative=5, random_state=0)
time = np.arange(240)  # any orderable axis: ints, timestamps, dates

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(scheme="timeseries", n_splits=3, n_test=40, purge=5, embargo=5),
    random_state=0,
).fit(X, y, time=time)

print(model.best_model_id_)

Calendar- and Δt-period folds

scheme="timeseries_period" walks forward over periods instead of rows: each fold tests a block of whole periods and trains on all strictly earlier ones (expanding). Set period to "month", "week" (ISO, Monday-anchored) or "day" for a datetime time= axis, or "delta" with a period_size (the window width) for a numeric axis. With period folds the integer knobs count periods, not rows: n_test is the test width in periods, step_periods is the walk-forward step (defaults to n_test, i.e. adjacent tiles) and purge/embargo are period gaps (the early-stopping tail n_es is the one exception — it always counts rows). Empty periods never produce a fold, and the resolved period counts land in run_report_["cv"].

import numpy as np
from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=360, n_features=8, n_informative=5, random_state=0)
time = np.arange("2021-01-01", "2022-01-01", dtype="datetime64[D]")[:360]  # ~12 months, daily

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(scheme="timeseries_period", period="month", n_splits=3, n_test=2),
    random_state=0,
).fit(X, y, time=time)

print(model.best_model_id_, model.run_report_["cv"])

A "train 5 / test 2 months" recipe is n_test=2 plus max_train_periods=5 (the rolling cap below); a numeric axis binned into fixed windows is period="delta", period_size=....

Wall-clock (Δt) gaps and rolling windows

On irregular axes (markets close at night and on weekends) a gap counted in rows spans a different real duration each time. purge_delta and embargo_delta instead measure the gap by time value — in the units the time= axis stores (for a datetime axis, its storage unit). They apply to both "timeseries" and "timeseries_period" and are mutually exclusive with the integer purge/embargo on the same axis (set one or the other, never both).

import numpy as np
from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=240, n_features=8, n_informative=5, random_state=0)
time = np.arange(240.0)  # a numeric time axis

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(scheme="timeseries", n_splits=3, n_test=40, purge_delta=5.0, embargo_delta=5.0),
    random_state=0,
).fit(X, y, time=time)

print(model.best_model_id_)

By default the train window expands — every fold trains on all earlier data. For non-stationary regimes cap the lookback: max_train_size keeps only the last N rows ("timeseries"), max_train_periods the last N periods ("timeseries_period"); leaving them unset keeps the expanding window.

import numpy as np
from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=240, n_features=8, n_informative=5, random_state=0)
time = np.arange(240)

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(scheme="timeseries", n_splits=3, n_test=40, max_train_size=80),
    random_state=0,
).fit(X, y, time=time)

print(model.best_model_id_)

Weighting unequal periods

By default the leaderboard score is pooled — one metric over all out-of-fold rows, so a month with more rows weighs more. With weighting="period" the score becomes the macro-average over periods (each period counts equally), and the significance band aggregates the bootstrap by period to match. It needs a time-ordered scheme and, under the default band, at least four periods with a defined metric (fewer fails fast). A period whose metric is undefined — e.g. a single-class month for ROC AUC — is dropped, and run_report_["cv"] reports the weighting mode plus n_periods_used.

import numpy as np
from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=360, n_features=8, n_informative=5, random_state=0)
time = np.arange("2021-01-01", "2022-01-01", dtype="datetime64[D]")[:360]

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(scheme="timeseries_period", period="month", n_splits=8, n_test=1, weighting="period"),
    random_state=0,
).fit(X, y, time=time)

cv = model.run_report_["cv"]
print(model.best_model_id_, cv["weighting"], cv["n_periods_used"])

The equivalence band

All candidates are scored out-of-fold on one shared split, then a seeded paired bootstrap builds an equivalence band: the set of candidates statistically indistinguishable from the top scorer. The simplest band member ships — a more complex model has to prove it is significantly better, not just luckier — and a tie is disclosed, never hidden:

band_member_ids_ — who was in the band,
band_width_ / band_unstable_ — how wide and how stable it was,
winner_by_tiebreak_ — whether the winner needed the simplicity tiebreak.

significance="off" disables the band and returns a pure argmax.

from sklearn.datasets import make_classification

from honestml import AutoML

X, y = make_classification(n_samples=200, n_features=8, n_informative=5, random_state=0)

model = AutoML(task="binary", models=("baseline", "linear"), random_state=0).fit(X, y)

print(model.best_model_id_)
print(model.band_member_ids_, model.band_width_, model.winner_by_tiebreak_)

Outer holdout and finalize

outer_holdout carves a fraction of the data once, before anything else runs. Selection, tuning and calibration see only the remaining DEV part; the winner is scored on the holdout exactly once and the score lands in holdout_score_. With finalize=True (the default) the shipped model is then refit on all data — after scoring, so the reported number stays a conservative estimate for the model you deploy. The carve is scheme-aware (stratified for classification, tail-of-time for time series).

from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=400, n_features=8, n_informative=5, random_state=0)

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(outer_holdout=0.25),
    random_state=0,
).fit(X, y)

print(model.best_model_id_, round(model.holdout_score_, 3))

Probability calibration

Classification only, opt-in: calibrate="sigmoid", "isotonic" or "auto". The calibrator is fit on out-of-fold predictions and is applied only when a cross-fitted check shows it actually improves the proper loss — otherwise the winner ships uncalibrated. calibration_ reports what happened.

from sklearn.datasets import make_classification

from honestml import AutoML, CVConfig

X, y = make_classification(n_samples=300, n_features=8, n_informative=5, random_state=0)

model = AutoML(
    task="binary",
    models=("baseline", "linear"),
    cv=CVConfig(calibrate="sigmoid"),
    random_state=0,
).fit(X, y)

print(model.calibration_)
proba = model.predict_proba(X)
print(proba.shape)

Everything on this page is recorded in the run report (run_report_): the resolved CV config, the band, the holdout score and the calibration outcome — see the quickstart for reports and artifacts.