Hyperparameter tuning and ensembling
How honestml treats model hyperparameters and when it blends models. Every
python block on this page is self-contained: copy any one of them and it
runs as-is — and every block is executed on each CI run, so the examples cannot
rot. Both tuning and ensembling are opt-in; the default run trains every
candidate with fixed hyperparameters and ships a single model.
Fixed, conservative defaults
By default no hyperparameter depends on your data. When the model zoo includes a
boosting model, each boosting fit uses early stopping (ADR-0080): the n_es tail
carved from every fold's training rows is held out as a validation set, the tree count
is raised to a generous ceiling and the library stops once validation stops improving.
The artifact manifest records the real early_stopping flag so the choice travels with
the model. A boosting fit without a validation tail (inner-CV HPO, or a fold too small
to spare one) falls back to a fixed n_estimators=300 (iterations for CatBoost) and
logs a WARNING that the comparison may favor overfit settings. The linear and baseline
models are plain sklearn defaults. There is no data-size-driven adaptation of any kind —
what you see in the leaderboard is the untuned, reproducible baseline of each library.
from sklearn.datasets import make_classification
from honestml import AutoML
X, y = make_classification(n_samples=240, n_features=8, n_informative=5, random_state=0)
model = AutoML(task="binary", models=("baseline", "lightgbm"), cv=3, random_state=0).fit(X, y)
print(model.best_model_id_, model.leaderboard_)
Tuning is opt-in: HPOConfig
Passing hpo=HPOConfig(...) tunes each tunable model type on an inner CV of
the development data, before the outer honest selection — the tuned candidate
then competes in the regular leaderboard like any other, so tuning can never
bypass the selection guarantees. backend="optuna" is the only backend and
needs the optuna extra (pip install "honestml[optuna]"); n_trials is the
per-model search budget, inner_cv the fold count of the tuning objective, and
models=None tunes every selected type that declares a search space (baseline
and linear declare none). keep_baseline=True keeps the untuned factory in the
leaderboard next to the tuned one; timeout_s caps each model's search
wall-clock but makes the result non-deterministic — disclosed in the run
report.
from sklearn.datasets import make_classification
from honestml import AutoML, HPOConfig
X, y = make_classification(n_samples=240, n_features=8, n_informative=5, random_state=0)
model = AutoML(
task="binary",
models=("baseline", "lightgbm"),
cv=3,
hpo=HPOConfig(n_trials=3, inner_cv=2),
random_state=0,
).fit(X, y)
tuned = model.run_report_["hpo"]["tuned"]["lightgbm"]
print(tuned["chosen_params"])
print(round(tuned["inner_best_score"], 3), tuned["n_trials_run"])
What the search explores and what is disclosed
Each model type declares its own search space; the tuner samples only those keys, and a tuned tree count overrides the fixed 300:
- CatBoost —
depth4–10,learning_rate0.01–0.3 (log),iterations50–500 (step 50),l2_leaf_reg1–10 (log),subsample0.6–1.0,one_hot_max_size2–64 (the one-hot↔target-statistics boundary for categoricals). - LightGBM —
max_depth3–10,learning_rate0.01–0.3 (log),n_estimators50–500 (step 50),reg_lambda0–10,subsample0.6–1.0,colsample_bytree0.5–1.0, plus the categorical-split regularizersmin_data_per_group10–300 andcat_smooth1.0–50.0. - XGBoost —
max_depth3–10,learning_rate0.01–0.3 (log),n_estimators50–500 (step 50),reg_lambda0–10,subsample0.6–1.0,colsample_bytree0.5–1.0.
The whole tuning story is disclosed in run_report_["hpo"]: the per-model
chosen_params, inner score and trial count, the cost estimate
(n_trials × inner_cv fits per tuned model), deterministic (False under a
timeout_s), and the honesty flags — the selection OOF is computed
post-tuning, and tuning runs on the full feature width even when feature
selection is also enabled.
from sklearn.datasets import make_classification
from honestml import AutoML, HPOConfig
X, y = make_classification(n_samples=200, n_features=8, n_informative=5, random_state=0)
model = AutoML(
task="binary",
models=("lightgbm",),
cv=3,
hpo=HPOConfig(n_trials=3, inner_cv=2),
random_state=0,
).fit(X, y)
hpo = model.run_report_["hpo"]
print(hpo["deterministic"], hpo["cost_estimate_fits"], hpo["selection_oof_is_post_tuning"])
Ensembling with a significance gate
ensemble=EnsembleConfig() blends the leaderboard candidates after the
honest selection, over their out-of-fold predictions — no extra refitting is
needed to evaluate the recipe. method="caruana" (the default) is greedy
ensemble selection with replacement plus seeded bagging (size caps the steps,
n_bags the bagging subsamples); method="weighted" solves for simplex
weights directly; metric=None blends on the run metric. The blend ships only
if it is significantly better than the best single model — the same
bootstrap significance gate selection uses — otherwise the single winner ships,
and the decision is always disclosed in run_report_["ensemble"]: applied,
the member_ids and weights, the OOF improvement oof_delta, and a
gate_reason such as significant_improvement, equivalent_to_best,
worse_than_best or degenerate_recipe (the weight search collapsed onto a
single member).
from sklearn.datasets import make_classification
from honestml import AutoML, EnsembleConfig
X, y = make_classification(n_samples=240, n_features=8, n_informative=5, random_state=0)
model = AutoML(
task="binary",
models=("baseline", "linear"),
ensemble=EnsembleConfig(),
random_state=0,
).fit(X, y)
ens = model.run_report_["ensemble"]
print(ens["applied"], ens["gate_reason"])
print(ens["member_ids"], ens["weights"])
Tuned candidates and the ensemble decision both flow through the same honest
machinery described in
cross-validation and honest selection, and every choice is
recorded in run_report_.