Getting Started¶
Python¶
import numpy as np
from forestfire import train
X = np.array([[0.0], [0.0], [1.0], [1.0]])
y = np.array([0.0, 0.0, 1.0, 1.0])
model = train(X, y, task="classification", tree_type="cart")
print(model.predict(X))
print(model.predict_proba(X))
ForestFire also accepts common missing-value markers such as None,
np.nan, pandas/NumPy NaN, and polars nulls during both training and
prediction.
Install:
pip install forestfire-ml
Import:
import forestfire
Rust¶
use forestfire_core::{train, Criterion, Task, TrainAlgorithm, TrainConfig, TreeType};
use forestfire_data::Table;
let x = vec![vec![0.0], vec![0.0], vec![1.0], vec![1.0]];
let y = vec![0.0, 0.0, 1.0, 1.0];
let table = Table::new(x, y)?;
let model = train(
&table,
TrainConfig {
algorithm: TrainAlgorithm::Dt,
task: Task::Classification,
tree_type: TreeType::Cart,
criterion: Criterion::Gini,
..TrainConfig::default()
},
)?;
Local development¶
task setup-local-env
task python-ext-develop
Useful tasks:
task testtask verifytask rust-verifytask docs-servetask docs-build
Verification note:
task verifyincludes the Python extension build path, not just Rust checks- if your environment is offline and the needed Python wheels are not already cached,
task verifycan fail duringtask python-ext-developbecausematurin developmay need to install dependencies such asnumpy
How to think about the API¶
The intended user flow is:
- give the library a feature matrix and target
- let
Tabledecide how to represent the data - call the unified
train(...)entrypoint - use
predict(...)on raw inference data rather than rebuilding a training table - optionally inspect
used_feature_indicesto see what the trained model actually depends on - optionally call
optimize_inference(...)when scoring is performance-critical - serialize the semantic model or snapshot the optimized runtime depending on deployment needs
That is why the API is organized around Table, train, predict, optimize_inference, and serialize, rather than around many learner-specific classes.
The most important conceptual distinction is:
Tableis primarily for training-time normalization and preprocessing- inference should usually consume raw user-facing inputs directly
- optimized inference derives its own projected runtime representation from the trained model
One more behavioral detail matters in practice:
- missing values are assigned to a separate training bin per feature
- candidate splits are chosen from the non-missing bins
- after the best observed split is found, the learner evaluates routing missing rows left vs right and stores the better choice
- if a split feature had no missing values at training time, a later missing value falls back to the node prediction: majority class for classification and node mean for regression
If you enable split_strategy="oblique" for cart or randomized, the split
itself becomes a sparse two-feature linear test, and missing routing is learned
independently for each participating feature.