Data input
What honestml accepts at fit, what it infers from your data and how it treats
missing values. Every python block on this page is self-contained: copy any
one of them and it runs as-is — and every block is executed on each CI run, so
the examples cannot rot.
Accepted inputs
fit(X, y) accepts a pandas DataFrame, a polars DataFrame or a 2-D numpy array;
y is any row-aligned array-like. A single boundary reader validates the input —
an empty frame, a length mismatch or an unsupported type raises
SchemaValidationError with a specific reason, never a bare ValueError deep
inside training. DataFrames keep their column names and may freely mix numeric,
string and datetime columns; a numpy array gets synthetic names f0..f{n-1} and
is typed per column like any other input.
import numpy as np
import pandas as pd
import polars as pl
from honestml import AutoML
rng = np.random.default_rng(0)
n = 200
df = pd.DataFrame(
{
"age": rng.normal(40.0, 10.0, n),
"plan": rng.choice(["free", "pro", "team"], n),
"usage": rng.exponential(1.0, n),
}
)
y = ((df["usage"] > 1.0) | (df["plan"] == "pro")).astype(int).to_numpy()
from_pandas = AutoML(task="binary", models=("baseline", "linear"), random_state=0).fit(df, y)
from_polars = AutoML(task="binary", models=("baseline", "linear"), random_state=0).fit(
pl.DataFrame(df.to_dict(orient="list")), y
)
from_numpy = AutoML(task="binary", models=("baseline", "linear"), random_state=0).fit(
df[["age", "usage"]].to_numpy(), y
)
print(from_pandas.best_model_id_, from_polars.best_model_id_, from_numpy.best_model_id_)
What schema inference does
Each column gets a role from its dtype: strings become categorical features,
dates become datetime, floats become numeric. Integer columns are inspected, not
trusted: a low-cardinality integer (≤ 20 distinct values by default) is treated
as categorical, and a nearly-all-unique integer is dropped as an id-like column —
the thresholds are Task fields (numeric_cat_max_unique, numeric_id_rate,
numeric_id_min_unique). Every categorical column gets a category table fitted
on the training data and frozen into the schema: known categories map to stable
integer codes, nulls to a reserved code, and a value unseen at fit maps to a
reserved unknown code at predict — never to a wrong known category — so the
train↔inference encoding cannot drift. The two boosting models that support it,
CatBoost and LightGBM, consume these columns as native categorical features
(CatBoost via ordered target statistics, LightGBM via its categorical_feature
splits); linear and baseline treat the same codes as ordinal integers. The
fitted schema ships inside the model artifact, and an inference batch where over
10% of a column's values were unseen at train triggers a drift warning.
import numpy as np
import pandas as pd
from honestml import AutoML
rng = np.random.default_rng(0)
n = 200
X = pd.DataFrame(
{
"income": rng.normal(0.0, 1.0, n),
"city": rng.choice(["riga", "tallinn", "vilnius"], n),
"rooms": rng.integers(1, 5, n), # low-cardinality int -> categorical
}
)
y = (X["income"] + (X["city"] == "riga") > 0.5).astype(int)
model = AutoML(task="binary", models=("baseline", "linear"), random_state=0).fit(X, y)
print({col: model.schema_.roles[col].value for col in X.columns})
print(model.schema_.categories["city"].categories)
X_new = X.head(5).assign(city="warsaw") # a category unseen at fit
print(model.predict(X_new).shape)
Declaring the task
task accepts a string — "binary", "multiclass" or "regression" — or a
Task object; the string is sugar for Task(kind=...). Task adds
positive_label (which class counts as positive for binary scoring; by default
label 1 when present, else the greatest label) and the auto-typing thresholds
above. The selection metric defaults per kind — roc_auc (binary), log_loss
(multiclass), rmse (regression) — and metric= overrides it by name:
roc_auc, pr_auc, accuracy, log_loss, brier, ece, rmse, mae. An
incompatible pair (a probability metric on a regression task, pr_auc on
multiclass) fails fast with ConfigError before any training runs.
import numpy as np
from sklearn.datasets import make_classification
from honestml import AutoML, Task
X, y01 = make_classification(n_samples=200, n_features=8, n_informative=5, random_state=0)
y = np.where(y01 == 1, "churn", "stay") # string labels work as-is
model = AutoML(
task=Task(kind="binary", positive_label="churn"),
metric="log_loss",
models=("baseline", "linear"),
random_state=0,
).fit(X, y)
print(model.classes_, model.run_report_["metric"])
print(model.predict_proba(X[:3]).shape)
Row-aligned metadata
sample_weight= is passed to fit next to y: one weight per row, never a
feature. Weights flow through the whole honest pipeline — each fold's training,
the out-of-fold scoring that ranks candidates, calibration and the final refit —
so the leaderboard and the shipped model agree on what a row is worth. groups=
(group-aware CV) and time=/label_time= (time-series CV) are the same kind of
row-aligned metadata and are covered on the
cross-validation page.
import numpy as np
from sklearn.datasets import make_classification
from honestml import AutoML
X, y = make_classification(n_samples=200, n_features=8, n_informative=5, random_state=0)
weights = np.where(y == 1, 2.0, 1.0) # the positive class counts double
model = AutoML(task="binary", models=("baseline", "linear"), random_state=0).fit(
X, y, sample_weight=weights
)
print(model.best_model_id_, round(model.score(X, y, sample_weight=weights), 3))
Missing values
honestml does not impute your data behind your back — it never rewrites the frame
you pass to fit, and categorical nulls get their own reserved code. Numeric NaN
reaches each model's boundary as-is, and every built-in then handles it on its own
terms: the boosting models (catboost, lightgbm, xgboost) split on NaN
natively, while linear and baseline carry a per-fold median imputer baked
inside their pipeline — fit fold by fold so it never leaks, and shipped with the
model so inference imputes the same way. No built-in is dropped for carrying NaN;
the skip-with-WARNING gate now fires only for a third-party plugin that declares it
cannot handle missing input. The example below therefore no longer needs the
boosting extra — linear/baseline alone tolerate NaN.
import numpy as np
from sklearn.datasets import make_classification
from honestml import AutoML
X, y = make_classification(n_samples=150, n_features=6, n_informative=4, random_state=0)
rng = np.random.default_rng(0)
X[rng.random(X.shape) < 0.1] = np.nan # 10% missing values
model = AutoML(task="binary", cv=3, random_state=0).fit(X, y) # models=None: every installed model
print(sorted(e.model_id for e in model.leaderboard_)) # all installed models rank: linear/baseline impute, boosting splits on NaN
print(model.best_model_id_)
Everything inferred here is observable after fit: model.schema_ carries the
roles and the frozen category tables, and the same schema ships inside the saved
artifact — see the quickstart for saving and serving it.