API reference
Everything below is importable from the top-level honestml package. Heavy training
dependencies are imported lazily, so import honestml stays fast — loading an artifact
for serving never executes the training stack.
Facade
After fit, the estimator exposes best_model_id_ (the honest winner), leaderboard_
(absolute OOF scores), fitted_ (the FittedModel serving handle for save_artifact)
and run_report_ (the JSON-serializable run report).
Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin
Fit a small leaderboard for a tabular task and expose the winner.
fit(X, y, sample_weight=None, groups=None, time=None, label_time=None)
Fit the leaderboard and expose the winner.
groups (per-row group labels) enables group-aware CV with
cv=CVConfig(scheme="group"): rows of the same group never span
train and test. time declares the CV time axis for cv=CVConfig(scheme="timeseries")
(purge/embargo, value-based order); label_time is the optional label-end-time
t1 for full de Prado purge. All are row-aligned metadata like sample_weight — not
features, not needed at predict time.
predict(X)
predict_proba(X)
score(X, y, sample_weight=None)
Metric score, sklearn convention (higher is better).
A lower-is-better metric (e.g. log_loss) is sign-flipped so grid-search
and Pipeline maximize it; leaderboard_ carries the raw, unflipped
value.
available_models(task=None)
staticmethod
Discoverable models (built-in + plugins) and their capabilities.
Read-only and lazy: reads descriptors without materializing any adapter, so a boosting plugin is listed even when its extra is not installed.
Artifacts and serving
save_artifact(model.fitted_, path) writes the fitted handle that AutoML.fit exposes
as the fitted_ attribute; load_artifact returns it back as a FittedModel — the
lightweight serving handle.
Serialize model to a versioned artifact directory.
Writes the data files first, then a checksums block (sha256 of every file plus a digest of
the manifest payload) so load_artifact can verify integrity before deserializing the model
body. sign is an optional hook: it receives the manifest digest (hex) and returns a
signature string written to signature for an authenticated verify= on load.
model_format picks the body serializer: "joblib" (the default) or "native" —
a boosting body goes through the library's stable format (xgb ubj / cat cbm / lgbm text)
instead of pickle; anything without a native format (sklearn models, a shipped ensemble)
transparently stays joblib.
Load an artifact directory into a :class:FittedModel.
Order: read manifest -> version-gate -> verify integrity -> model_type dispatch +
deserialize. require_integrity makes a missing checksums block an error (older
artifacts warn by default); verify is an optional signature hook
(signature, manifest_digest) -> bool.
SECURITY: a joblib body and calibrator.joblib are deserialized via joblib/pickle (a native
boosting body is a structural file instead). The sha256 integrity check detects corruption
and naive substitution, NOT authenticity — a malicious author can embed code with a matching
digest; use verify (a signature) and load only from a trusted source. The version-gate is
compatibility-only, not a trust check.
A fitted model with its preprocessing schema — the unified inference path.
classes is the global class order for classification and None for regression,
so the inference path is kind-aware: multiclass proba is aligned to it and
a regression model has no probabilities.
predict(X)
predict_proba(X)
score(X, y, sample_weight=None)
Export model to a standalone ONNX bundle in directory; returns the parity report.
sample (raw rows, anything the model can predict on) is REQUIRED: the model retains no
training matrix, and without data the honesty gate cannot run — there is no
silent skip. The gate compares the converted graph (float32, onnxruntime) against the
native estimator's RAW output and raises :class:SchemaValidationError on a breach;
a benign near-tie label flip (top-2 gap within the float32 noise band) is downgraded to a
WARNING and recorded in onnx_manifest.json. Requires the onnx extra.
Run report
save_run_report writes the run_report_ mapping produced by AutoML.fit as JSON;
render_report turns it into markdown or self-contained HTML.
Write report as indented UTF-8 JSON, returning the written file path.
If path is an existing directory, the report is written to
path/run_report.json; otherwise path is the file itself. With
overwrite=False an existing target raises :class:FileExistsError.
Render the run report as markdown or self-contained HTML.
report is the run_report_ mapping or a path to a saved run_report.json
(round-trip with :func:save_run_report). fmt="md" needs nothing beyond the
stdlib; fmt="html" embeds matplotlib charts as base64 PNG when the report
extra is installed and degrades gracefully (WARNING, no charts) when it is not.
If path is an existing directory the file is path/run_report.<fmt>.
Configuration
RunConfig is the resolved run configuration that AutoML.fit records in the run
manifest (run_report_["config"]). You configure AutoML through its constructor
arguments, which accept the section classes below directly: cv=CVConfig(...),
budget=BudgetConfig(...), feature_engineering=FEConfig(...),
feature_selection=FeatureSelectionConfig(...), hpo=HPOConfig(...),
ensemble=EnsembleConfig(...). TrackerConfig stands apart: it configures the
experiment tracker passed through the tracker argument of AutoML.
Bases: pydantic.main.BaseModel
Top-level run configuration; serializable basis of the run manifest.
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
parse(data)
classmethod
Validate untrusted input, raising :class:ConfigError on failure.
Bases: pydantic.main.BaseModel
Cross-validation scheme and its parameters.
scheme="auto" resolves to Task.default_cv_scheme at composition time;
unimplemented schemes/params fail fast there, never silently.
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
Bases: pydantic.main.BaseModel
Run budget: "none" (unbounded, default), wall-clock "time" or "trials".
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
Bases: pydantic.main.BaseModel
Feature-engineering catalog toggles; all transformers default off.
A fixed, configurable catalog (not a plugin port). datetime deltas are a separate per-row
axis driven by Task.report_date, NOT part of this config. Target-encoding is
binary-classification-only; multiclass/regression gracefully skip it.
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
Bases: pydantic.main.BaseModel
Feature-selection catalog; opt-in, default OFF via fs=None.
Ranker strategies importance/random_probe/null_importance/shap (lazy shap
extra) plus the wrapper sequential (FeatureSubsetSelector port). compare runs
several strategies and picks one subset-winner; compare=None is the single-strategy path.
arbitration chooses the locus: "holdout" (a DEV-internal selection-holdout) or
"nested" (K-fold on DEV; timeseries = expanding-window) with an honest significance winner.
Anti-leakage OOF ranking/scoring lives in the application; the winning subset serializes into
FeatureSchema. cutoff applies only to ranker strategies — sequential returns its own
subset (seq_*). null_importance works on every scheme: i.i.d. schemes permute uniformly,
timeseries/group permute the target WITHIN structure blocks of null_block_size
rows / per group. Per-strategy randomness is isolated via a stable seed hash.
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
Bases: pydantic.main.BaseModel
Hyperparameter-optimization catalog; opt-in, default OFF via hpo=None.
When set, composition tunes each tunable model type on an inner-CV of DEV (before the outer
honest selection): the tuned factory replaces (or, with keep_baseline, augments) the
baseline in the leaderboard. n_trials is the per-model search budget (distinct from
BudgetConfig.n_trials, the run candidate-loop); inner_cv is the inner fold count of the
tuning objective. timeout_s (per-model wall-clock cap) makes the search non-deterministic —
surfaced in the run-report. models=None tunes every type with a non-empty search_space.
The whole config is in the run-fingerprint (changed HPO -> new cache key).
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
Bases: pydantic.main.BaseModel
Ensembling catalog; opt-in, default OFF via ensemble=None.
When set (and run_mode='full'), composition blends the leaderboard candidates after the
honest selection and ships a :class:BlendedEstimator only if the blend is significantly
better than the best single (the same SignificanceTest gate selection uses); otherwise the
single winner is shipped. method is the weight search: "caruana" (default, greedy with
replacement + seeded bagging) or "weighted" (SLSQP simplex). size caps Caruana steps /
library; n_bags is the bagging count (1 = no bagging). metric=None blends on the run
metric. The whole config is in the run-fingerprint (a changed ensemble config -> a new cache key).
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
Bases: pydantic.main.BaseModel
Experiment-tracking opt-in; default OFF via tracker=None.
Post-selection observability: NOT part of :class:RunConfig / the run-fingerprint —
tracking cannot change the model (like finalize).
tracking_uri=None defers to the backend's own resolution (e.g. env
MLFLOW_TRACKING_URI -> file:./mlruns); run_name=None lets the backend
generate a neutral, data-independent name.
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
Data and selection types
Task, FeatureSchema, ColumnRole and Dataset describe the input data;
SelectionPolicy, Candidate and select_best implement final-model selection.
Bases: pydantic.main.BaseModel
Problem definition: kind + target metric name + split/typing policy.
default_cv_scheme
property
Default cross-validation scheme when the user does not override it.
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
target_metric
property
The declared target metric name, or the default for this kind.
Bases: pydantic.main.BaseModel
Typed column contract: roles + schema-owned category tables + NaN policy.
Serializable, so the same schema (including fitted category tables and FE specs) is reused at
inference. Built/validated by the Reader at the data boundary. The FE specs
(datetime_spec/target_encoding/frequency_encoding/intersections) are additive and
default None so an older artifact loads unchanged.
categorical
property
CATEGORICAL features: original_categorical ⊕ intersections.
features
property
Model-facing features in the pinned FE block order.
numeric ⊕ categorical where each block is itself FE-block-ordered, so this equals
original_numeric ⊕ datetime ⊕ frequency ⊕ target_encoding ⊕ original_categorical ⊕
intersections. design_matrix materializes the numeric block then the categorical codes,
so column j of the model input is exactly features[j]. Without FE this is the
unchanged numeric + categorical.
model_config = {'extra': 'forbid'}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
numeric
property
NUMERIC features: original_numeric ⊕ datetime ⊕ frequency ⊕ target_encoding.
Block order is derived from the FE specs, not the roles-dict insertion order, so it is deterministic and identical train==inference. Without FE this is the plain role view, unchanged.
time
property
The TIME-role column (CV time axis), distinct from DATETIME features.
categorical_indices(cap=None)
Positions of natively-routed CATEGORICAL columns in the post-FS-projection design matrix.
Projects features to selected_features (in schema.features order, matching
design_matrix) then takes the positions of the cardinality-gated categorical names
(:func:native_routable); includes intersections (a__b) subject to the same gate and
excludes the FE numeric outputs (_te/_freq/datetime). cap=None keeps every
categorical (ungated opt-out, ADR-0092/0094). Empty when the (possibly projected/gated) set
carries no native categoricals — a legitimate native no-op.
with_categories(tables)
Return a copy of the schema with the fitted category tables attached.
with_datetime_spec(spec)
Return a copy with the fitted datetime-delta spec attached.
with_frequency_encoding(spec)
Return a copy with the fitted frequency-encoding spec attached.
with_intersections(spec)
Return a copy with the intersection spec attached; pair tables go in categories.
with_selected_features(names)
Return a copy carrying the selected feature subset; design_matrix projects to it.
with_target_encoding(spec)
Return a copy with the fitted full-train target-encoding spec attached.
Bases: builtins.str, enum.Enum
Role a column plays. The core never hard-codes domain column names.
CATEGORICAL = <ColumnRole.CATEGORICAL: 'categorical'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
DATETIME = <ColumnRole.DATETIME: 'datetime'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
FOLD = <ColumnRole.FOLD: 'fold'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
GROUP = <ColumnRole.GROUP: 'group'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
IGNORE = <ColumnRole.IGNORE: 'ignore'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
NUMERIC = <ColumnRole.NUMERIC: 'numeric'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
TARGET = <ColumnRole.TARGET: 'target'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
TEXT = <ColumnRole.TEXT: 'text'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
TIME = <ColumnRole.TIME: 'time'>
class-attribute
Role a column plays. The core never hard-codes domain column names.
Bases: typing.Protocol
Domain view over tabular data: numeric block, categorical codes, target.
categorical_codes()
Categorical feature codes as int64 with shape (n_rows, n_categorical).
Codes come from the schema-owned category table, so they are identical on train and inference.
groups()
Group-column values in row order, or None when there is no group role.
The single source of group labels for group-aware CV: the splitter
and validate_fold both read this array, index-aligned with design_matrix,
so the group/fold/feature ordering cannot drift.
label_time()
Optional per-row label-end-time t1 for full de Prado purge, or None.
Name-based secondary metadata (like sample_weight), present only when declared; used by
the splitter to drop train rows whose label window overlaps the test interval.
sample_weight()
Per-row sample weights, or None.
select(columns)
Return a dataset restricted to columns (schema updated accordingly).
take(indices)
Return a dataset with only the given row indices (fold slicing).
target()
Target values, or None for an inference dataset.
time()
TIME-role column values in row order, or None.
The single, index-aligned source of the CV time axis for TimeSeriesSplitter and the
value-based validate_fold (same contract as groups()), so the splitter never reads
a reserved column name from the frame. Distinct from DATETIME features.
to_numpy()
Numeric feature block as float64 with shape (n_rows, n_numeric).
with_selected_features(names)
Return a dataset whose schema carries the feature-selection subset.
Same rows/frame; only schema.selected_features is set, so design_matrix projects the
model input to names on refit and inference (train==inference by construction).
Bases: pydantic.main.BaseModel
Selection rule: absolute primary metric + inert lexicographic tie-break.
model_config = {'extra': 'forbid', 'frozen': True}
class-attribute
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)
A leaderboard entry: its absolute score plus secondary, OOF predictions.
oof_pred is the metric-ready out-of-fold vector the band aligns on:
P(positive)/(n, K) proba for proba-metrics, else the predicted class/value. oof_mask
marks which rows actually have an OOF prediction (holdout yields a partial OOF; degenerate
folds are skipped), so validity is tracked by the mask, never np.isnan —
which would crash on int/str class vectors.
n_features = 0
class-attribute
int([x]) -> integer int(x, base=10) -> integer
Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.int(). For floating-point numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by '+' or '-' and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal.
int('0b100', base=0) 4
stability = 0.0
class-attribute
Convert a string or number to a floating-point number, if possible.
train_time = 0.0
class-attribute
Convert a string or number to a floating-point number, if possible.
Runtime utilities
honestml.__version__ — the installed package version.
Mutable state of a single run: timings, logger, config.
manifest()
Serializable run manifest (config + timings) — basis for replay.
record_stage_time(key, stage, elapsed)
Record a stage time loaded from cache (no timer).
timed_stage(key, stage)
Time a stage and record the elapsed seconds under timings[key][stage].
total_time(key)
Sum of all stage times recorded for key.
Exceptions
All errors derive from honestml.AutoMLError:
Bases: honestml.core.exceptions.AutoMLError
Invalid configuration (wraps validation failures at the boundary).
Bases: honestml.core.exceptions.AutoMLError
Input does not satisfy the FeatureSchema/Task contract.
Covers X/y length mismatch, unknown/missing columns, targets outside
Task.kind, empty or all-NaN inputs, and dtype drift.
Also the artifact/serialization format boundary: an unknown
model_type/model_format and a non-exportable estimator are the same
kind of contract violation, not a new exception type.
Bases: honestml.core.exceptions.AutoMLError
An optional extra is required but not installed.
Raised by adapters (not by core imports) so a missing boosting/tracking
library surfaces as an actionable message instead of an ImportError deep
in an import chain.
Bases: honestml.core.exceptions.AutoMLError
An artifact failed integrity verification before deserialization.
reason is one of missing_checksums (no checksums block under
require_integrity), missing_file (a checksummed file is absent or its name
escapes the artifact directory), digest_mismatch (a file's sha256 differs —
corruption or naive tampering) or signature_mismatch (the optional signature
hook rejected the artifact). Integrity detects corruption/naive substitution, NOT
authenticity: a malicious author can embed code with a matching digest — use a
signature (and load only from a trusted source) for that.
Bases: honestml.core.exceptions.AutoMLError
A fitted artifact was used before fit (e.g. predict on a fresh model).
Bases: honestml.core.exceptions.AutoMLError
The run budget was exhausted before any candidate completed.
Raised only when the budget skipped candidates and none finished — distinct from
:class:FitFailedError (every candidate that started failed on its own). Carries the
budget mode and the completed/skipped/failed counts for an actionable message.
Bases: honestml.core.exceptions.AutoMLError
A feature-selection strategy failed during compare (fail-fast).
Raised when any strategy in FeatureSelectionConfig.compare raises while selecting its subset:
the offending strategy name is reported and the original error chained, instead of silently
dropping a strategy from the comparison (no silent defaults).
fit may also raise more specific subclasses — notably FitFailedError (importable
from honestml.core) when every candidate fails; catch AutoMLError to cover them all.