Python API¶
The Python surface is centered on:
Tabletrain(...)ModelOptimizedModel- sklearn-compatible wrappers in
forestfire.tree,forestfire.forest, andforestfire.gbm
Training¶
train(
x,
y=None,
algorithm="dt",
task="auto",
tree_type="cart",
split_strategy="axis_aligned",
builder="greedy",
lookahead_depth=1,
lookahead_top_k=8,
lookahead_weight=1.0,
beam_width=4,
criterion="auto",
canaries=2,
bins="auto",
histogram_bins=None,
physical_cores=None,
max_depth=None,
min_samples_split=None,
min_samples_leaf=None,
n_trees=None,
max_features=None,
seed=None,
compute_oob=False,
learning_rate=None,
bootstrap=False,
top_gradient_fraction=None,
other_gradient_fraction=None,
missing_value_strategy=None,
categorical_strategy=None,
categorical_features=None,
target_smoothing=20.0,
filter=None,
sample_weight=None,
)
Supported values¶
algorithm="dt" | "rf" | "gbm"task="auto" | "regression" | "classification"tree_type="id3" | "c45" | "cart" | "randomized" | "oblivious"split_strategy="axis_aligned" | "oblique"builder="greedy" | "lookahead" | "beam" | "optimal"criterion="auto" | "gini" | "entropy" | "mean" | "median"
Parameter semantics¶
algorithm¶
dt: one treerf: bagged ensemble with bootstrap sampling and feature subsamplinggbm: stage-wise second-order boosting with shrinkage and gradient-focused row sampling
task¶
task="auto" infers:
- classification for integer, boolean, and string targets
- regression for float targets
tree_type¶
id3: entropy-first classifierc45: practical extension of ID3cart: standard binary treerandomized: stochastic split-search variantoblivious: symmetric tree with one split per depth
split_strategy¶
axis_aligned: ordinary one-feature threshold splitsoblique: two-feature linear splits of the formw1 * x_i + w2 * x_j <= t
Current support matrix:
axis_aligned: supported everywhereoblique: supported fordt,rf, andgbmwhentree_typeiscartorrandomized
Current oblique behavior:
- all candidate feature pairs available at the node are considered
- the learned split is still sparse and pairwise: exactly two features per node
- missing values are routed independently per participating feature rather than forcing a single node-level missing fallback
criterion¶
Current auto behavior:
id3,c45classification ->entropy- classification
cart,randomized,oblivious->gini - regression models ->
mean gbmtrains second-order trees internally whencriterion="auto"
builder¶
greedy: ordinary immediate-gain split rankinglookahead: bounded recursive subtree rescoring over the top immediate candidatesbeam: the same bounded recursive rescoring, but with width-limited continuation retention during future searchoptimal: exhaustively score the full downstream subtree for every legal split until a real stopping condition is reached
Builder controls tree construction strategy independently of:
algorithmtree_typesplit_strategy
Related parameters:
lookahead_depthlookahead_top_klookahead_weightbeam_width
For optimal, those four tuning knobs are ignored. Search size is controlled
instead by the ordinary tree limits and by canary filtering.
For the detailed behavior, see:
canaries¶
Canaries are shuffled copies of already-preprocessed features used for automatic growth stopping.
Current stopping behavior:
- standard trees stop at the current node when no acceptable real split survives canary competition
- oblivious trees stop the remaining depth growth when no acceptable real level-split survives canary competition
gbmstops adding new stages when no acceptable real root split survives canary competition
Canaries are active for dt and gbm.
Random forests are the exception:
rfdeliberately ignores canaries during tree training- so
canariesandfilterdo not affect random-forest growth policy
filter¶
filter controls how strict canary competition is once split candidates have been scored and ranked.
Accepted forms:
None- positive integer
- float in
[0, 1)
The ranking rule is:
- score split candidates as usual
- sort them from best to worst
- look only inside the allowed top window
- choose the best real feature inside that window
- if the window contains only canaries, stop growth under the usual canary rule
filter=None is the default strict policy and is equivalent to filter=1:
- only the single best-ranked candidate is eligible
- if that candidate is a canary, the node stops
If filter is an integer n:
- the chosen real feature must appear within the top
nscored candidates - canaries are still allowed to occupy earlier ranks
- the selected split is the highest-ranked real split inside that top-
nwindow
Example:
filter=3means “after sorting all candidates, ignore canaries if needed, but only within the top 3 ranked candidates”
If filter is a float alpha in [0, 1):
- ForestFire converts it into a top-window fraction of
1 - alpha - if there are
kscored candidates, the allowed window size isceil((1 - alpha) * k) - the chosen split is again the highest-ranked real split inside that window
Example:
filter=0.95means “look only at the top 5% of ranked candidates”
A few practical details matter:
- the window is computed over scored split candidates, not just over raw input columns
- the exact candidate count can vary by algorithm and by node because
max_features, tree type, and node-local feasibility all affect how many candidates are actually scorable - for oblivious trees, the competition happens at the next shared level-split
- for
gbm, the same logic is applied at the root of the next stage, and if no real split survives inside the allowed window, that whole stage is discarded and boosting stops gbmalso acceptsboosting_first_stage_retry_filter, which uses the same value shape asfilter- if omitted, its effective default is still top-1, so the retry does not relax anything unless you widen it explicitly
- pass a larger integer, such as
2, or a float like0.95to widen the first-stage retry window; passFalseto disable the retry entirely
bins¶
Current values:
"auto"- integer
1..=512
Current auto behavior:
- per numeric feature, ForestFire picks the highest power of two up to
512 - each realized bin must contain at least two rows
- the chosen count is capped by the number of distinct observed values
bins applies when ForestFire is preprocessing raw training data into a
Table. It controls the stored numeric representation of that table.
histogram_bins¶
Current values:
None"auto"- integer
1..=128
Semantics:
None: reuse the numeric bins already present in the input training table"auto"or an integer: rebuild the numeric training view at that resolution before split search
This is the estimator-facing control for histogram width. It is separate from
bins:
binscontrols how a raw Python input is preprocessed into a training tablehistogram_binscontrols the numeric resolution used by split-search histograms
That distinction matters when:
- you pass an already-built
Tabletotrain(...) - you want one stored table representation but a different histogram resolution during fitting
physical_cores¶
This controls CPU usage during fitting. ForestFire uses physical cores as the public knob because split scoring is memory-sensitive and that is a better limit than logical threads for this workload.
compute_oob¶
Only meaningful for algorithm="rf".
- exposes
model.compute_oob - exposes
model.oob_score - classification uses OOB accuracy
- regression uses OOB
R^2
learning_rate¶
Only meaningful for algorithm="gbm".
- each stage prediction is multiplied by
learning_rate - lower values generally require more trees
bootstrap¶
Only meaningful for algorithm="gbm".
False: each stage starts from the full table, then applies gradient-focused row samplingTrue: each stage first draws a bootstrap sample, then applies gradient-focused row sampling
top_gradient_fraction and other_gradient_fraction¶
Only meaningful for algorithm="gbm".
top_gradient_fractionkeeps the largest-gradient rowsother_gradient_fractionsamples additional rows from the remainder
missing_value_strategy¶
Controls how split search handles features with missing values.
Accepted forms:
"heuristic""optimal"{"col_1": "heuristic", "col_2": "optimal", ...}{"f0": "heuristic", "f1": "optimal", ...}
Semantics:
"heuristic": choose the best split using only observed values first, then evaluate whether the missing rows should go left or right for that chosen split"optimal": evaluate missing-left vs missing-right while scoring every candidate split, then choose the overall best combination of split plus missing routing- dictionary form: apply the chosen strategy per feature, defaulting unspecified features to
"heuristic"
The dictionary keys use semantic feature indices:
"col_1"means feature index0"col_2"means feature index1"f0"means feature index0"f1"means feature index1
Tradeoff:
"heuristic"is much faster and is the default"optimal"can be substantially slower because it expands the split search
Current implementation note:
- the strategy setting is implemented for the standard first-order tree training paths
- the second-order boosting path uses the same learned missing-routing semantics, but it does not expose a separate heuristic-vs-optimal toggle
sample_weight¶
Per-row training weights. Accepted forms:
None(default): all rows have equal weight1.0- 1-D array-like of length
n_rows: explicit per-row weights
Semantics:
- weights scale how much each row contributes to split scoring and leaf values
- weighted MSE is used for regression:
Σ w_i (y_i - ȳ)² - for gradient boosting, weights scale the per-row gradient and Hessian
- weights propagate through bootstrap and gradient-focus sampling in
rfandgbm
Supported for dt, rf, and gbm across all supported tree types and tasks.
Multi-target regression¶
When y is a 2-D array of shape (n_rows, n_targets) and task="regression",
ForestFire trains a single tree that predicts all targets jointly.
import numpy as np
import forestfire
X = np.random.randn(200, 4)
y = np.column_stack([X[:, 0] + X[:, 1], X[:, 2] - X[:, 3]]) # 2 targets
model = forestfire.train(X, y, task="regression")
preds = model.predict(X) # shape (200, 2)
Behavior:
predict(...)returns a 2-D NumPy array of shape(n_rows, n_targets)- split scoring maximizes the sum of MSE gain across all targets at each candidate threshold, evaluated at the same threshold for all targets so score and applied split are consistent
- each leaf stores one predicted value per target
sample_weightis compatible with multi-target regression
Supported for algorithm="dt" and task="regression". Multi-target is not
currently supported for rf or gbm.
Supported input types¶
- NumPy arrays
- Python sequences
- pandas
- polars
- pyarrow
- SciPy dense matrices
- SciPy sparse matrices
Single-row prediction also accepts 1D inputs like:
[1, 2, 3]np.array([1, 2, 3])
The key API distinction is:
- training can use raw inputs or an explicit
Table - inference should normally use raw inputs directly
That means Table is a training-oriented preprocessing container, not the preferred prediction input type.
Sklearn wrappers¶
ForestFire also exposes sklearn-compatible estimators on top of the Rust backend.
Import paths:
from forestfire.tree import ...from forestfire.forest import ...from forestfire.gbm import ...
Examples:
from forestfire.tree import ObliviousRegressor
from forestfire.forest import CARTRandomForestRegressor
from forestfire.gbm import ExtraGBMRegressor
tree = ObliviousRegressor(max_depth=4).fit(X, y)
forest = CARTRandomForestRegressor(n_estimators=200).fit(X, y)
gbm = ExtraGBMRegressor(n_estimators=100, learning_rate=0.05).fit(X, y)
Available wrappers:
forestfire.treeID3ClassifierC45ClassifierCARTClassifierExtraTreeClassifierObliviousTreeClassifierCARTRegressorExtraTreeRegressor-
ObliviousTreeRegressor -
forestfire.forest ID3RandomForestClassifierC45RandomForestClassifierCARTRandomForestClassifierExtraRandomForestClassifierObliviousRandomForestClassifierCARTRandomForestRegressorExtraRandomForestRegressor-
ObliviousRandomForestRegressor -
forestfire.gbm CARTGBMClassifierExtraGBMClassifierObliviousGBMClassifierCARTGBMRegressorExtraGBMRegressorObliviousGBMRegressor
Sklearn wrapper semantics:
- they call the same Rust-backed
train(...)API under the hood fit(...),predict(...), and classifierpredict_proba(...)are supported- fitted estimators expose
model_ - classifiers expose
classes_ - fitted estimators expose
n_features_in_when the input shape is available get_params(...)andset_params(...)are supportedsample_weightis forwarded to the underlyingtrain(...)call when provided
Wrapper defaults intentionally differ from raw train(...) in one place:
- sklearn wrappers default to
canaries=0
That avoids canary-based early stopping surprising users on small sklearn-style toy datasets.
Missing values¶
ForestFire accepts the common missing-value representations that usually appear through those inputs:
- Python
None - floating-point
NaN - pandas/NumPy
NaN polarsnull values
Training and prediction treat those values as missing rather than rejecting them.
The split semantics are:
- each feature reserves a separate missing bin
- split search ignores that bin when choosing observed thresholds or branch groupings
- the exact missing-row search behavior then depends on
missing_value_strategy - if a feature had no missing rows at a learned split, a later missing value falls back to the node prediction instead of pretending that the feature had seen a trained missing branch
Under missing_value_strategy="heuristic":
- choose the split from observed values first
- then decide whether missing rows should go left or right for that chosen split
Under missing_value_strategy="optimal":
- for each candidate split, evaluate both missing-left and missing-right
- keep the best joint combination of split and missing routing
That fallback is:
- majority class or node probabilities for classification
- node mean prediction for regression
Tables and input handling¶
Table¶
Table is the public container for validated training data. You can pass raw data directly to train(...), but building a Table explicitly is useful when you want preprocessing and validation separated from fitting.
Table chooses between:
DenseTablefor mixed numeric/binary dataSparseTablefor binary sparse inputs
Why Table exists at all:
- training wants one normalized, validated, binned representation
- all learners should see the same preprocessing contract
- canaries, auto binning, and sparse-vs-dense decisions belong in one place
Why Table is not the main inference abstraction:
- inference often starts from raw application data
- prediction should not require users to construct a training-oriented container first
- optimized runtimes now do their own lightweight projected preprocessing directly from raw inputs
DenseTable¶
DenseTable is Arrow-backed and optimized for repeated feature scans. Numeric features are rank-binned into a power-of-two number of bins, using the highest populated count up to 512 by default while keeping at least two rows per realized bin, and binary 0/1 columns are stored as booleans.
That combination is deliberate:
- Arrow arrays provide compact columnar storage
- power-of-two bins make later runtime layouts simpler
- forcing at least two rows per realized bin avoids wasting domain size on near-empty bins
- boolean storage keeps true binary columns cheap during both training and inference
Each dense feature also reserves one extra bin for missing values so split search can handle missingness without rebucketing the data at every node.
SparseTable¶
SparseTable is binary-only. Internally it stores, per feature, the row positions where the value is 1, so memory usage scales with the positive entries rather than the full dense shape.
This is useful because many sparse inputs are really presence/absence matrices. In that case the right abstraction is not “a giant mostly-zero dense matrix”; it is “which rows contain a positive value for each feature”.
Main model methods¶
predict(...)predict_proba(...)optimize_inference(...)serialize(...)to_ir_json(...)to_dataframe(...)
Optimized inference¶
optimize_inference(...) returns an OptimizedModel that preserves model semantics while lowering execution into a runtime-oriented representation.
Python signature:
optimized = model.optimize_inference(
physical_cores=None,
missing_features=None,
)
Key runtime changes:
- CART-style binary trees use compact fallthrough/jump layouts
- multiway classifier splits use dense lookup tables
- oblivious trees use compact level arrays
- optimized models project inputs down to the features that actually appear in splits
- forests and boosted ensembles are lowered in a feature-locality-friendly tree order
- multi-row inputs are preprocessed together before scoring
- compiled binary and oblivious runtimes use compact column-major binned matrices
- row batches are scored in parallel across physical cores
The runtime pipeline is:
- inspect the semantic model
- compute
used_feature_indices - lower trees into runtime-friendly structures
- accept raw inference input
- validate it against the semantic schema
- preprocess only the projected feature subset
- score rows through scalar or batch-oriented execution
That is why optimized inference can reduce total latency even when the tree traversal itself is only moderately faster: it often avoids preprocessing columns that were never going to be used.
Missing checks in optimized runtimes¶
By default, optimized runtimes preserve missing-aware inference for every used feature:
optimized = model.optimize_inference()
If you know that only some semantic feature indices may be missing at prediction time, pass them explicitly:
optimized = model.optimize_inference(missing_features=[0, 4, 9])
Semantics:
missing_features=None: keep missing checks for every used featuremissing_features=[...]: only optimized nodes that split on those semantic feature indices keep explicit missing handlingmissing_features=[]: omit missing checks entirely in the optimized runtime
This is a runtime-only optimization knob. Use it only when you control inference inputs and know which columns can actually be missing. Otherwise, the default is the safe choice. If an excluded feature later arrives missing, the optimized model will not execute the learned missing-specific branch for that split.
Using runtime metadata¶
The most useful runtime inspection values are:
model.used_feature_indicesmodel.used_feature_countoptimized.used_feature_indicesoptimized.used_feature_count
Example:
optimized = model.optimize_inference()
print(model.used_feature_indices)
print(optimized.used_feature_count)
Those values are semantic, not profiler-derived. They come from the trained splits in the model.
It helps most on:
- large prediction batches
- deeper trees
- repeated scoring of the same model
- compiled binary trees
- wide inputs where the trained model only touches a small subset of columns
Compiled optimized artifacts¶
An optimized Python model can also be serialized into a compiled artifact:
optimized = model.optimize_inference()
payload = optimized.serialize_compiled()
restored = forestfire.OptimizedModel.deserialize_compiled(payload)
That artifact contains:
- the semantic IR
- the lowered runtime layout
- the feature projection metadata
This is useful when you want:
- faster reloads of the optimized runtime
- the same optimized execution strategy after deserialization
- one deployment artifact for repeated scoring
The semantic JSON serialization and the compiled optimized artifact solve different problems. The JSON form is the canonical model meaning; the compiled artifact is the cached execution form.
Feature importances¶
Both Model and OptimizedModel expose a feature_importances_ attribute:
importances = model.feature_importances_ # NDArray[float64], shape (n_features,)
This is Mean Decrease Impurity (MDI):
- for each split node, the gain is weighted by the fraction of training rows that passed through it
- for ensembles, the per-tree importances are averaged across all trees
- values are non-negative and sum to
1.0
Introspection¶
tree_counttree_structure(...)tree_prediction_stats(...)tree_node(...)tree_level(...)tree_leaf(...)
to_dataframe(...) returns a polars.DataFrame when polars is installed and falls back to a pyarrow.Table otherwise.
Typical use cases:
- understanding realized tree shape after training
- inspecting cutoffs and leaf payloads
- summarizing leaf prediction distributions
- inspecting one tree at a time inside a forest or boosted ensemble