Training¶

Algorithms¶

dt: one tree
rf: random forest
gbm: gradient boosting

These are separate algorithms, not just cosmetic wrappers around the same tree builder:

dt optimizes a single tree directly and is useful when interpretability matters most
rf reduces variance by averaging many independently sampled trees
gbm fits trees stage-by-stage to residual structure and is meant to trade interpretability for stronger predictive performance

Support matrix¶

algorithm="dt"
regression: cart, randomized, oblivious
classification: id3, c45, cart, randomized, oblivious
algorithm="rf"
regression: cart, randomized, oblivious
classification: id3, c45, cart, randomized, oblivious
algorithm="gbm"
regression: cart, randomized, oblivious
classification: cart, randomized, oblivious with binary targets only

The support matrix is narrower for boosting because boosting needs more than a split criterion and a leaf value. It needs consistent gradient/Hessian handling, additive stage updates, and stable leaf semantics under repeated residual fitting. That is why the current boosting support is focused on the binary-split regression-style tree families that map cleanly onto second-order stage training.

Tree types¶

Classification:

id3
c45
cart
randomized
oblivious

Regression:

cart
randomized
oblivious

Why multiple tree types exist:

id3 and c45 keep the classic multiway classifier style and are useful when you want a direct, explicit branch-per-bin structure
cart is the general-purpose binary-tree baseline and the most natural fit for forests and boosting
randomized introduces stochasticity into split search, which is useful both as a standalone learner and as the basis for extra-trees-style ensembles
oblivious enforces one split per depth across the tree, which makes the final model more regular and easier to lower into compact runtime layouts

Split strategies¶

ForestFire separates the tree family from the split family.

The public split strategies are:

axis_aligned
oblique

axis_aligned is the default and means ordinary one-feature threshold splits.

oblique means a sparse pairwise linear split:

w1 * x_i + w2 * x_j <= t

Current support matrix:

axis_aligned: all supported tree families
oblique: dt, rf, and gbm when tree_type is cart or randomized

Not currently supported for:

id3
c45
oblivious

Current implementation details:

oblique nodes are currently pairwise, not arbitrary k-feature hyperplanes
all candidate feature pairs available at a node are considered
axis-aligned and oblique candidates compete inside the same canary-filtered ranking

Builders¶

ForestFire also separates tree family from tree-construction strategy.

The public builders are:

greedy
lookahead
beam
optimal

At a high level:

greedy ranks candidates by immediate node-local score only
lookahead uses a bounded version of the same recursive subtree scorer as optimal, limited by lookahead_depth, lookahead_top_k, and lookahead_weight
beam uses that same bounded scorer, but keeps up to beam_width continuations alive during future search before taking the strongest surviving continuation value
optimal recursively evaluates all legal cuts until ordinary stopping rules or canary blocking end that branch

The builder controls how a node decides which split to take. It does not change:

the split family itself
the missing-value semantics of the learner
the leaf payload semantics

Related parameters:

lookahead_depth
lookahead_top_k
lookahead_weight
beam_width

Those tuning knobs apply to lookahead and beam. optimal ignores them.

For the detailed algorithmic behavior, see:

Task detection¶

With task="auto":

integer, boolean, and string targets become classification
float targets become regression

This keeps the common path ergonomic without hiding the important distinction that task choice changes:

the split objective
the leaf payload
the valid tree families
the prediction API surface, especially predict_proba(...)

Missing values during training¶

ForestFire treats missing values as first-class training inputs rather than as rows that must be dropped beforehand.

Accepted missing markers include:

Python None
floating-point NaN
pandas/NumPy NaN
polars null values

The training contract is:

every feature gets a dedicated missing bin
observed split search ignores that missing bin
missing rows are then routed according to the configured missing_value_strategy

The public strategies are:

"heuristic": choose the best split from observed values first, then decide whether missing rows should go left or right for that split
"optimal": for every candidate split, evaluate missing-left and missing-right, then choose the best full combination
per-column dictionary: assign "heuristic" or "optimal" feature by feature, with unspecified features defaulting to "heuristic"

Python examples:

train(X, y, missing_value_strategy="heuristic")
train(X, y, missing_value_strategy="optimal")
train(X, y, missing_value_strategy={"col_1": "optimal", "col_2": "heuristic"})
train(X, y, missing_value_strategy={"f0": "optimal", "f1": "heuristic"})

The feature-name forms are aliases for semantic column indices:

"col_1" and "f0" both mean feature index 0
"col_2" and "f1" both mean feature index 1

Why both strategies exist:

"heuristic" is the practical default and keeps training cost closer to ordinary split search
"optimal" can be much slower because it expands the search to all candidate split and missing-routing combinations

That design matters because it separates two questions that are often muddled together:

what is the best split among observed values?
given that split, where should the missing rows go?

When a node never saw missing values for its split feature during training, the model does not invent a learned missing branch. A later missing value at prediction time falls back to the node prediction instead:

majority class for classification
node mean for regression

Current implementation note:

the strategy toggle is implemented for the standard first-order tree builders
the second-order boosting path uses the same learned missing-routing semantics, but it does not expose a separate heuristic-vs-optimal toggle

Categorical features¶

Categorical handling uses the same public train(...) interface as numeric training. There is no separate categorical training API.

The current strategies are:

dummy
target
fisher

At a high level:

dummy expands nominal categories into indicator-style derived columns
target replaces categories with smoothed target-derived statistics
fisher orders categories by target-derived statistics and then lets the tree learner split over that learned ordering

The implementation is split intentionally:

dummy and target are table-layer transforms
fisher is implemented in core because it is tied to category ordering for threshold-based split search

From the trainer’s point of view, categorical handling happens before tree growth and produces the numeric/binary representation consumed by the existing split machinery. That means:

trees, forests, and GBM all use the same categorical entry path
canaries compete against the transformed features rather than raw category labels
histogram-based split search works over the transformed representation

Example:

train(
    X,
    y,
    algorithm="gbm",
    task="classification",
    tree_type="cart",
    categorical_strategy="fisher",
)

Categorical models now preserve that transform contract in:

semantic serialization
IR export
compiled optimized artifacts

So restored categorical models still accept raw categorical inputs rather than requiring callers to reproduce the transform manually.

Sample weights¶

ForestFire accepts per-row training weights via sample_weight.

None (default): all rows contribute equally
1-D numeric array of length n_rows: rows are weighted by those values

For regression, split scoring uses weighted MSE: Σ w_i (y_i - ȳ)².

For gradient boosting, weights scale the per-row gradient and Hessian so high-weight rows drive larger update steps.

Weights are propagated through bootstrap sampling in rf and through gradient-focus row sampling in gbm, so ensemble methods respect weights correctly.

Multi-target regression¶

When y is a 2-D array of shape (n_rows, n_targets) with n_targets > 1 and task="regression", a single tree is trained to predict all targets jointly.

splits are chosen by maximising the sum of MSE gain across all targets, evaluated at a shared threshold per feature so score and applied split are always consistent
each leaf stores one predicted value per target
predict(...) returns a 2-D array of shape (n_rows, n_targets)
compatible with sample_weight

Multi-target training uses algorithm="dt". Multi-target support for rf and gbm is not currently available.

Stopping and control parameters¶

max_depth
min_samples_split
min_samples_leaf
max_features
seed

These parameters materially affect learning across the supported tree families, including forest constituent trees.

The rationale behind these controls is straightforward:

max_depth limits structural complexity directly and is the cleanest way to cap traversal cost
min_samples_split prevents very small unstable internal nodes
min_samples_leaf prevents highly confident leaves built from tiny support
max_features controls correlation and search width, which matters much more in ensembles than in a single tree
seed is there because the library deliberately uses randomness in several places: bootstrapping, feature subsampling, randomized splits, and boosting row sampling

The intended semantics of seed are:

fixed seed -> repeatable training for the same dataset and configuration
different seed -> legitimately different stochastic choices

That randomness is not left ad hoc inside each learner. ForestFire now uses a shared seed-mixing path for:

forest tree seeds
boosting stage seeds
node-local feature-subset sampling
randomized threshold selection

One practical detail matters here: node-local randomization is derived from the rows owned by the node, but not from the incidental order of the mutable row-index buffer. That makes the stochastic behavior more stable under internal partitioning details while remaining sensitive to the actual training data seen at each node.

Canaries¶

ForestFire uses canary features to stop growth during training.

For the full rationale and algorithm-specific behavior, see Canary Strategy.

Behavior:

standard trees stop at the current node if no acceptable real split survives canary competition
oblivious trees stop further depth growth if no acceptable real level-split survives canary competition
gradient boosting stops adding stages if no acceptable real root split survives canary competition
random forests ignore canaries during tree training

By default, that competition is strict:

filter=None behaves like filter=1
the top-ranked candidate must already be a real feature

You can soften that rule with filter=...:

filter=3 means the chosen real split must appear within the top 3 scored candidates
filter=0.95 means the chosen real split must appear within the top 5% of scored candidates

In both cases, candidates are still scored and sorted in the usual way first. The difference is only which ranked window is allowed to contain the chosen real feature.

Why canaries exist:

standard tree learners can always continue to fit noise if the stopping rule only looks at local impurity decrease
a canary feature is a shuffled copy of real signal space, so if the learner prefers it, the current split quality is no better than structured noise
this turns stopping into a training-time competition between real and noise-like features instead of a later pruning pass

Why random forests ignore them:

forests already reduce overfitting primarily through bootstrapping and feature subsampling
reintroducing canaries inside each tree would often stop growth for the wrong reason, because each constituent tree already sees a deliberately restricted view of the data
in boosting, the opposite tradeoff is useful: canaries remain active because stage-wise residual fitting is much more willing to chase noise late in training

Random forests¶

bootstrap sampling per tree
feature subsampling per node
optional OOB score computation

The random-forest implementation is intentionally close to the classical idea:

each tree trains on a bootstrap sample
each node considers only a sampled feature subset
predictions are aggregated by averaging leaf outputs or probabilities

The important design choice is that preprocessing is still shared. Trees do not each rebucket the raw data. They all operate over the same preprocessed representation, which keeps training semantics aligned and avoids duplicated preprocessing work.

Gradient boosting¶

second-order trees trained from gradients and Hessians
learning_rate
optional stage bootstrapping
LightGBM-style gradient-focused row sampling via:
top_gradient_fraction
other_gradient_fraction

The boosting implementation is LightGBM-like in spirit, but not a clone.

It keeps the ideas that matter most:

second-order tree fitting from gradients and Hessians
shrinkage via learning_rate
optional row sampling
focusing stage fitting on high-gradient rows

But it keeps a ForestFire-specific stopping rule:

canaries still participate in split selection
if no real root split survives inside the allowed canary filter window, that stage is discarded and boosting stops
the first boosting stage can optionally retry with boosting_first_stage_retry_filter
that retry filter defaults to TopN(1), so it preserves the strict stop unless you widen it explicitly

That choice reflects the project’s general design preference: use explicit training-time noise competition as a stopping signal instead of bolting on a separate pruning or early-stopping layer later.

How second-order tree fitting works¶

The boosting stage loop is serial, but the tree builder inside one stage is now organized so it can become more parallel internally.

That distinction matters:

stage k + 1 depends on the predictions produced after stage k
the nodes inside the single tree fitted at stage k do not all depend on one another in the same way

ForestFire therefore treats "fit one boosting stage" and "grow one tree inside that stage" as separate scheduling problems.

At a high level, one boosting stage does this:

compute per-row gradients and Hessians from the current ensemble prediction
fit one second-order tree against those gradient/Hessian pairs
scale that tree by learning_rate
add it to the ensemble
repeat until the stopping rule rejects the next stage

The second-order tree itself uses Newton-style leaf values. For any node, the leaf prediction is derived from the node-local gradient and Hessian totals:

leaf_value = -G / (H + lambda)

where:

G is the sum of gradients in the node
H is the sum of Hessians in the node
lambda is the L2 regularization term

The split score asks whether dividing the current rows into left and right children increases the regularized objective strength enough to justify growing the tree.

The active-node frontier¶

CART and randomized trees — regression, classification, and gradient boosting alike — now grow with an active-node frontier instead of depth-first recursion.

The frontier is simply the set of nodes at the current depth that are still eligible to split.

That means tree growth now looks like this:

start with the root in the frontier
evaluate every node in that frontier
decide which nodes actually split
partition rows for those winning splits
create all children
use those children as the next frontier

This is different from the older depth-first pattern:

evaluate one node
partition its rows immediately
recurse into the left child
recurse into the right child
only later return to same-depth siblings

The frontier-based layout is important because it separates node evaluation from row-buffer mutation.

In the standard binary tree path, ForestFire still uses one shared mutable row-index buffer rather than copying rows into separate node-local allocations. Each active node owns a contiguous slice of that buffer. During evaluation, the builder reads that slice, computes node statistics, constructs or reuses histograms, and scores candidate features. It does not mutate the slice yet.

Only after the whole current frontier has been evaluated does the builder begin partitioning rows for the nodes that actually split.

That separation is the architectural point of the frontier:

evaluation is "read the current node state and score split candidates"
partitioning is "rewrite the shared row-index slice so left/right children occupy contiguous ranges"
child creation is "record those new ranges as the next active frontier"

The standard CART/randomized path now uses that boundary in three explicit phases for each depth:

evaluate the current frontier in parallel
partition row ranges for the nodes that actually split in parallel across disjoint row-buffer slices
build child histograms for the next frontier in parallel

That means the frontier is no longer only a structural preparation for future parallelism. It is now the execution model used by all standard CART/randomized builders: regression, classification, and gradient boosting.

What is evaluated for each active node¶

For every active node, the builder computes:

sample count
gradient sum
Hessian sum
the node leaf prediction
the node objective strength before splitting
feature histograms for the node rows
the best surviving split candidate after canary filtering

The stopping checks happen at this evaluation stage:

max depth
min_samples_split
non-positive or too-small Hessian mass
no valid feature split after min_samples_leaf and min_sum_hessian_in_leaf
canary competition blocking the node at the root

If a node does not produce a valid split, it becomes a leaf and never enters the next frontier.

If it does split, the builder retains enough information to mutate the row buffer afterward without rescoring the node.

When training parallelism is enabled, that evaluation phase runs across the whole active frontier at once. Each node still scores its own candidate features using the same histogram-based split logic as before, but same-depth nodes can now do that work concurrently.

The later mutation phase is now parallel too. Pending same-depth splits are sorted by their owned row ranges, then partitioned through a recursive divide-and-conquer pass over disjoint slices of the shared row-index buffer. That keeps the in-place row-buffer model intact while allowing non-overlapping partitions to run concurrently.

Oblique splitting now participates in that same frontier flow for standard cart and randomized trees across all learner families.

In that mode:

axis-aligned and oblique candidates are ranked together at each active node
oblique candidates are still sparse pairwise linear splits
all candidate feature pairs available at the node are considered
per-feature missing routing is learned for the two participating features and replayed at inference time

Histograms and child reuse¶

The second-order tree path is histogram-based.

For each feature and node, the histogram stores:

row count
gradient sum
Hessian sum

Those histograms are what let the builder score candidate thresholds without rescanning raw rows for every possible split.

When a node is partitioned, the implementation tries to avoid rebuilding both children from scratch:

it builds histograms for the smaller child directly from that child’s rows
it derives the sibling histograms by subtracting the smaller child from the parent histograms

That matters because histogram construction is one of the dominant training costs. Reusing the parent histogram avoids paying that full cost twice per split.

That reuse now happens inside the frontier pipeline as well. After partitioning has produced the child row ranges for every winning split at the current depth, the builder constructs the next frontier’s histograms in parallel across those splitting nodes.

Why the frontier matters for parallelism¶

The frontier provides both the execution boundary and the actual parallel work for all standard CART/randomized trees.

Without a frontier, same-depth siblings are entangled with mutation order:

node A is evaluated
node A partitions the shared row buffer
node A recurses into its children
only later does node B get evaluated

That ordering makes same-depth batching awkward because one branch has already started rewriting shared training state before its siblings have even been scored.

With a frontier, every node at a depth is evaluated against the same stable view of the current row ownership. That makes several future steps much cleaner:

scoring active nodes in parallel
feature-parallel split search inside each node evaluation
partitioning disjoint same-depth row ranges in parallel once split choices are known
building child histograms across splitting nodes in parallel after partitioning
postponing row partitioning until after split selection is complete

For gradient boosting, the stage loop still remains serial. The frontier only changes the work inside one stage’s tree fit.

That is the intended balance:

keep stage semantics simple and correct
make intra-tree work increasingly parallel over time

Relationship to oblivious trees¶

Oblivious second-order trees were already level-wise by construction because every node at a given depth shares the same feature/threshold pair.

The active-node frontier brings all standard CART/randomized paths closer to that same batching style, but without changing the standard-tree semantics:

standard trees still choose splits independently per node
oblivious trees still choose one shared split per depth

What they share is the key scheduling idea that depth-level work can be evaluated as a batch before the next level is created.

Training optimizations¶

ForestFire training is optimized around a compact binned core and shared row-index buffers rather than row copies.

numeric features are pre-binned into compact integer ranks, capped at 512 bins
each feature reserves one extra missing bin alongside the observed bins
bins="auto" chooses the highest populated power-of-two count per feature while keeping at least two rows in every realized bin
histogram_bins=... can override the numeric resolution used during split search without requiring callers to rebuild the source Table
long-running training and prediction release the Python GIL before entering the Rust hot path
CART and randomized trees use histogram-based numeric split search
standard binary trees partition row indices in place
ID3 and C4.5 use the same in-place row-buffer approach for multiway branches
oblivious trees train over shared row buffers plus per-leaf ranges
CART/randomized classification and mean-regression builders reuse parent histograms and derive sibling histograms by subtraction
oblivious split scoring reuses cached per-leaf counts or sum/sum_sq
random forests parallelize across trees while limiting intra-tree parallelism
all standard CART/randomized trees (regression, classification, and gradient boosting) now grow through a level-wise active-node frontier, with parallel frontier evaluation, feature-parallel split scoring, parallel row partitioning over disjoint slices, and parallel child-histogram construction separated by explicit frontier phases
binary sparse inputs stay sparse through training and inference
classifier, regressor, and second-order tree builders now share the same core histogram/partitioning/randomization helpers instead of carrying separate implementations of the same mechanics

Reading the snippets below¶

As in the runtime page, the snippets here are representative shapes rather than exact copies of every production function.

They are meant to show:

the straightforward naïve way to structure the work
the shape ForestFire actually optimizes for
why the optimized shape is cheaper in memory traffic, allocation, or repeated computation

The real implementations live across crates/data/src and crates/core/src/tree, but these snippets are the shortest path to understanding the architecture.

1. Binning once instead of rescanning raw floats at every node¶

Naïve approach¶

fn best_threshold_naive(values: &[f64], rows: &[usize]) -> Option<f64> {
    let mut best = None;
    for &candidate_row in rows {
        let threshold = values[candidate_row];
        let (left, right): (Vec<_>, Vec<_>) = rows
            .iter()
            .copied()
            .partition(|&row| values[row] <= threshold);
        let score = score_split_from_raw_values(values, &left, &right);
        best = pick_better(best, (threshold, score));
    }
    best.map(|(threshold, _)| threshold)
}

This rescans raw floating-point values and repartitions rows for every candidate threshold.

ForestFire approach¶

fn best_threshold_binned(
    binned_values: &[u16],
    histogram: &[HistogramBin],
) -> Option<u16> {
    let mut best = None;
    let mut left = HistogramBin::default();
    let total = histogram_total(histogram);

    for (bin, bucket) in histogram.iter().enumerate() {
        left = left + bucket;
        let right = total - left;
        let score = score_split_from_histograms(&left, &right);
        best = pick_better(best, (bin as u16, score));
    }

    best.map(|(bin, _)| bin)
}

Why this helps:

candidate space is bounded by the realized bins
split scoring works from aggregated counts or sums rather than rescanning rows
the same binned representation is reused by every learner family and later by optimized inference

2. In-place row partitioning instead of copying child row vectors¶

Naïve approach¶

fn split_rows_naive(rows: &[usize], feature: usize, threshold: u16, table: &dyn TableAccess) -> (Vec<usize>, Vec<usize>) {
    rows.iter()
        .copied()
        .partition(|row| table.binned_value(feature, *row) <= threshold)
}

This allocates new child vectors at every split.

ForestFire approach¶

fn split_rows_in_place(
    rows: &mut [usize],
    feature: usize,
    threshold: u16,
    table: &dyn TableAccess,
) -> usize {
    let mut boundary = 0usize;
    for idx in 0..rows.len() {
        if table.binned_value(feature, rows[idx]) <= threshold {
            rows.swap(boundary, idx);
            boundary += 1;
        }
    }
    boundary
}

Now one mutable row-index buffer is reused throughout growth:

rows[..boundary] is the left child
rows[boundary..] is the right child

Why this helps:

far fewer allocations
better cache locality on the row-index buffer
easier reuse of the same data structure across CART, randomized, and second-order trees

3. Histogram subtraction instead of rebuilding sibling statistics¶

Naïve approach¶

fn child_histograms_naive(
    left_rows: &[usize],
    right_rows: &[usize],
    table: &dyn TableAccess,
) -> (Vec<HistogramBin>, Vec<HistogramBin>) {
    (
        build_histograms_from_rows(left_rows, table),
        build_histograms_from_rows(right_rows, table),
    )
}

This does full work for both children even though the parent histogram is already known.

ForestFire approach¶

fn child_histograms_subtractive(
    parent: &[HistogramBin],
    left: &[HistogramBin],
) -> (Vec<HistogramBin>, Vec<HistogramBin>) {
    let right = parent
        .iter()
        .zip(left.iter())
        .map(|(parent_bin, left_bin)| parent_bin.subtract(left_bin))
        .collect::<Vec<_>>();
    (left.to_vec(), right)
}

Why this helps:

one child histogram can be built directly
the sibling becomes derived work instead of independent work
the savings compound at every internal node in binary tree growth

4. Shared row buffers for oblivious training¶

Naïve approach¶

fn grow_oblivious_naive(leaves: &[Vec<usize>], split: Split, table: &dyn TableAccess) -> Vec<Vec<usize>> {
    let mut next = Vec::new();
    for leaf_rows in leaves {
        let (left, right) = leaf_rows
            .iter()
            .copied()
            .partition(|row| table.binned_value(split.feature_index, *row) <= split.threshold_bin);
        next.push(left);
        next.push(right);
    }
    next
}

This keeps allocating fresh vectors per leaf at every depth.

ForestFire approach¶

struct LeafRange {
    start: usize,
    end: usize,
}

fn grow_oblivious_in_place(
    rows: &mut [usize],
    leaves: &[LeafRange],
    split: Split,
    table: &dyn TableAccess,
) -> Vec<LeafRange> {
    let mut next = Vec::with_capacity(leaves.len() * 2);
    for leaf in leaves {
        let boundary = split_rows_in_place(
            &mut rows[leaf.start..leaf.end],
            split.feature_index,
            split.threshold_bin,
            table,
        );
        next.push(LeafRange {
            start: leaf.start,
            end: leaf.start + boundary,
        });
        next.push(LeafRange {
            start: leaf.start + boundary,
            end: leaf.end,
        });
    }
    next
}

Why this helps:

one row buffer is reused for the whole tree
leaves are tracked by ranges instead of fresh row vectors
this matches the symmetric structure of oblivious trees and keeps the implementation regular

5. Deterministic randomization through shared seed mixing¶

Naïve approach¶

let mut rng = StdRng::seed_from_u64(user_seed);
let threshold = candidates[rng.gen_range(0..candidates.len())];

This makes results depend on the accidental order in which random draws happen across nodes, trees, or stages.

ForestFire approach¶

let tree_seed = mix_seed(base_seed, tree_index as u64);
let node_seed = node_seed(tree_seed, depth, salt, rows);
let threshold = choose_random_threshold(&candidates, feature_index, rows, node_seed)?;

Why this helps:

fixed seed means repeatable training
different trees, stages, and nodes still get different random contexts
node-local choices depend on the row set, not on incidental temporary ordering

That is important for randomized trees, forests, and gradient boosting stages alike.

6. Parallelize across trees where independence is real¶

Naïve approach¶

let trees = (0..n_trees)
    .map(|tree_index| train_one_tree(sample_for(tree_index)))
    .collect::<Result<Vec<_>, _>>()?;

This is correct, but it leaves independent bootstrap trees serialized.

ForestFire approach¶

let trees = (0..n_trees)
    .into_par_iter()
    .map(|tree_index| train_one_tree(sample_for(tree_index)))
    .collect::<Result<Vec<_>, _>>()?;

Why this helps:

forest trees are naturally independent training jobs
the speedup lands at the algorithm level without changing single-tree semantics
intra-tree work can stay conservative while inter-tree work scales out

ForestFire deliberately prefers this shape for forests because the independence is obvious and the coordination cost is low.

7. Stage-wise boosting on shared preprocessed data¶

Naïve approach¶

for stage in 0..n_trees {
    let residual_table = rebuild_table_from_current_residuals(x, residuals);
    let tree = train_tree(&residual_table)?;
    update_predictions(&mut predictions, &tree);
}

This rebuilds too much state per stage.

ForestFire approach¶

for stage in 0..n_trees {
    let (gradients, hessians) = compute_gradients_and_hessians(&predictions, targets);
    let sampled_rows = gradient_focus_sample(base_rows, &gradients, &hessians, ...);
    let sampled_table = SampledTable::new(train_set, sampled_rows.row_indices);
    let tree = train_second_order_tree(&sampled_table, &sampled_rows.gradients, &sampled_rows.hessians)?;
    update_predictions(&mut predictions, &tree, learning_rate);
}

Why this helps:

the same preprocessed base table is reused
only gradients, Hessians, and sampled row views change per stage
stage computation stays focused on residual structure instead of rebuilding the whole world

8. Keep sparse binary inputs sparse¶

Naïve approach¶

fn sparse_to_dense(x: SparseInput) -> Vec<Vec<f64>> {
    let mut dense = vec![vec![0.0; x.n_features]; x.n_rows];
    for (feature, rows) in x.nonzero_rows {
        for row in rows {
            dense[row][feature] = 1.0;
        }
    }
    dense
}

This pays the full dense memory cost before training even starts.

ForestFire approach¶

struct SparseBinaryColumn {
    row_indices: Vec<usize>,
}

struct SparseTable {
    columns: Vec<SparseBinaryColumn>,
    n_rows: usize,
}

Why this helps:

storage scales with positive entries, not full shape
binary-feature semantics stay explicit
the same sparse structure can feed both training and inference paths

Naïve approach¶

mod classifier_training {
    fn build_histograms(...) { ... }
    fn partition_rows(...) { ... }
}

mod regressor_training {
    fn build_histograms(...) { ... }
    fn partition_rows(...) { ... }
}

mod second_order_training {
    fn build_histograms(...) { ... }
    fn partition_rows(...) { ... }
}

This invites drift:

one path gets faster
another keeps the old bug or old allocation pattern
randomization semantics stop lining up

ForestFire approach¶

mod tree::shared {
    fn candidate_feature_indices(...) { ... }
    fn partition_rows_for_binary_split(...) { ... }
    fn choose_random_threshold(...) { ... }
    fn mix_seed(...) { ... }
}

Why this helps:

performance-sensitive fixes land once
first-order and second-order trees stay behaviorally aligned
the codebase can evolve without silently splitting into several incompatible training engines

Why these optimizations matter:

binning turns repeated threshold search into a bounded discrete problem, which makes both scoring and runtime lowering much cheaper
shared row-index buffers avoid per-node row-vector allocation, which is one of the main hidden costs in naive tree builders
histogram subtraction matters because sibling statistics are not independent work; once one child is known, the other is often derivable from the parent
shared tree internals matter because performance-sensitive fixes to histogram handling, partitioning, and stochastic split selection now land once and propagate across first-order and second-order learners together
keeping sparse binary inputs sparse avoids paying a dense-memory penalty for data that is structurally sparse

Impact in practice:

better cache behavior during split search
less allocator pressure during recursive growth
more predictable runtime layouts for optimized inference
much better scaling in forests, where repeated tree training amplifies every avoidable allocation and recounting cost

Training¶

Algorithms¶

Support matrix¶

Tree types¶

Split strategies¶

Builders¶

Task detection¶

Missing values during training¶

Categorical features¶

Sample weights¶

Multi-target regression¶

Stopping and control parameters¶

Canaries¶

Random forests¶

Gradient boosting¶

How second-order tree fitting works¶

The active-node frontier¶

What is evaluated for each active node¶

Histograms and child reuse¶

Why the frontier matters for parallelism¶

Relationship to oblivious trees¶

Training optimizations¶

Reading the snippets below¶

1. Binning once instead of rescanning raw floats at every node¶

Naïve approach¶

ForestFire approach¶

2. In-place row partitioning instead of copying child row vectors¶

Naïve approach¶

ForestFire approach¶

3. Histogram subtraction instead of rebuilding sibling statistics¶

Naïve approach¶

ForestFire approach¶

4. Shared row buffers for oblivious training¶

Naïve approach¶

ForestFire approach¶

5. Deterministic randomization through shared seed mixing¶

Naïve approach¶

ForestFire approach¶

6. Parallelize across trees where independence is real¶

Naïve approach¶

ForestFire approach¶

7. Stage-wise boosting on shared preprocessed data¶

Naïve approach¶

ForestFire approach¶

8. Keep sparse binary inputs sparse¶

Naïve approach¶

ForestFire approach¶

9. Share core mechanics across learner families¶

Naïve approach¶

ForestFire approach¶