Training¶
Algorithms¶
dt: one treerf: random forestgbm: gradient boosting
These are separate algorithms, not just cosmetic wrappers around the same tree builder:
dtoptimizes a single tree directly and is useful when interpretability matters mostrfreduces variance by averaging many independently sampled treesgbmfits trees stage-by-stage to residual structure and is meant to trade interpretability for stronger predictive performance
Support matrix¶
algorithm="dt"- regression:
cart,randomized,oblivious - classification:
id3,c45,cart,randomized,oblivious algorithm="rf"- regression:
cart,randomized,oblivious - classification:
id3,c45,cart,randomized,oblivious algorithm="gbm"- regression:
cart,randomized,oblivious - classification:
cart,randomized,obliviouswith binary targets only
The support matrix is narrower for boosting because boosting needs more than a split criterion and a leaf value. It needs consistent gradient/Hessian handling, additive stage updates, and stable leaf semantics under repeated residual fitting. That is why the current boosting support is focused on the binary-split regression-style tree families that map cleanly onto second-order stage training.
Tree types¶
Classification:
id3c45cartrandomizedoblivious
Regression:
cartrandomizedoblivious
Why multiple tree types exist:
id3andc45keep the classic multiway classifier style and are useful when you want a direct, explicit branch-per-bin structurecartis the general-purpose binary-tree baseline and the most natural fit for forests and boostingrandomizedintroduces stochasticity into split search, which is useful both as a standalone learner and as the basis for extra-trees-style ensemblesobliviousenforces one split per depth across the tree, which makes the final model more regular and easier to lower into compact runtime layouts
Split strategies¶
ForestFire separates the tree family from the split family.
The public split strategies are:
axis_alignedoblique
axis_aligned is the default and means ordinary one-feature threshold splits.
oblique means a sparse pairwise linear split:
w1 * x_i + w2 * x_j <= t
Current support matrix:
axis_aligned: all supported tree familiesoblique:dt,rf, andgbmwhentree_typeiscartorrandomized
Not currently supported for:
id3c45oblivious
Current implementation details:
- oblique nodes are currently pairwise, not arbitrary
k-feature hyperplanes - all candidate feature pairs available at a node are considered
- axis-aligned and oblique candidates compete inside the same canary-filtered ranking
Builders¶
ForestFire also separates tree family from tree-construction strategy.
The public builders are:
greedylookaheadbeamoptimal
At a high level:
greedyranks candidates by immediate node-local score onlylookaheaduses a bounded version of the same recursive subtree scorer asoptimal, limited bylookahead_depth,lookahead_top_k, andlookahead_weightbeamuses that same bounded scorer, but keeps up tobeam_widthcontinuations alive during future search before taking the strongest surviving continuation valueoptimalrecursively evaluates all legal cuts until ordinary stopping rules or canary blocking end that branch
The builder controls how a node decides which split to take. It does not change:
- the split family itself
- the missing-value semantics of the learner
- the leaf payload semantics
Related parameters:
lookahead_depthlookahead_top_klookahead_weightbeam_width
Those tuning knobs apply to lookahead and beam. optimal ignores them.
For the detailed algorithmic behavior, see:
Task detection¶
With task="auto":
- integer, boolean, and string targets become classification
- float targets become regression
This keeps the common path ergonomic without hiding the important distinction that task choice changes:
- the split objective
- the leaf payload
- the valid tree families
- the prediction API surface, especially
predict_proba(...)
Missing values during training¶
ForestFire treats missing values as first-class training inputs rather than as rows that must be dropped beforehand.
Accepted missing markers include:
- Python
None - floating-point
NaN - pandas/NumPy
NaN polarsnull values
The training contract is:
- every feature gets a dedicated missing bin
- observed split search ignores that missing bin
- missing rows are then routed according to the configured
missing_value_strategy
The public strategies are:
"heuristic": choose the best split from observed values first, then decide whether missing rows should go left or right for that split"optimal": for every candidate split, evaluate missing-left and missing-right, then choose the best full combination- per-column dictionary: assign
"heuristic"or"optimal"feature by feature, with unspecified features defaulting to"heuristic"
Python examples:
train(X, y, missing_value_strategy="heuristic")
train(X, y, missing_value_strategy="optimal")
train(X, y, missing_value_strategy={"col_1": "optimal", "col_2": "heuristic"})
train(X, y, missing_value_strategy={"f0": "optimal", "f1": "heuristic"})
The feature-name forms are aliases for semantic column indices:
"col_1"and"f0"both mean feature index0"col_2"and"f1"both mean feature index1
Why both strategies exist:
"heuristic"is the practical default and keeps training cost closer to ordinary split search"optimal"can be much slower because it expands the search to all candidate split and missing-routing combinations
That design matters because it separates two questions that are often muddled together:
- what is the best split among observed values?
- given that split, where should the missing rows go?
When a node never saw missing values for its split feature during training, the model does not invent a learned missing branch. A later missing value at prediction time falls back to the node prediction instead:
- majority class for classification
- node mean for regression
Current implementation note:
- the strategy toggle is implemented for the standard first-order tree builders
- the second-order boosting path uses the same learned missing-routing semantics, but it does not expose a separate heuristic-vs-optimal toggle
Categorical features¶
Categorical handling uses the same public train(...) interface as numeric
training. There is no separate categorical training API.
The current strategies are:
dummytargetfisher
At a high level:
dummyexpands nominal categories into indicator-style derived columnstargetreplaces categories with smoothed target-derived statisticsfisherorders categories by target-derived statistics and then lets the tree learner split over that learned ordering
The implementation is split intentionally:
dummyandtargetare table-layer transformsfisheris implemented in core because it is tied to category ordering for threshold-based split search
From the trainer’s point of view, categorical handling happens before tree growth and produces the numeric/binary representation consumed by the existing split machinery. That means:
- trees, forests, and GBM all use the same categorical entry path
- canaries compete against the transformed features rather than raw category labels
- histogram-based split search works over the transformed representation
Example:
train(
X,
y,
algorithm="gbm",
task="classification",
tree_type="cart",
categorical_strategy="fisher",
)
Categorical models now preserve that transform contract in:
- semantic serialization
- IR export
- compiled optimized artifacts
So restored categorical models still accept raw categorical inputs rather than requiring callers to reproduce the transform manually.
Sample weights¶
ForestFire accepts per-row training weights via sample_weight.
None(default): all rows contribute equally- 1-D numeric array of length
n_rows: rows are weighted by those values
For regression, split scoring uses weighted MSE: Σ w_i (y_i - ȳ)².
For gradient boosting, weights scale the per-row gradient and Hessian so high-weight rows drive larger update steps.
Weights are propagated through bootstrap sampling in rf and through gradient-focus row sampling in gbm, so ensemble methods respect weights correctly.
Multi-target regression¶
When y is a 2-D array of shape (n_rows, n_targets) with n_targets > 1
and task="regression", a single tree is trained to predict all targets
jointly.
- splits are chosen by maximising the sum of MSE gain across all targets, evaluated at a shared threshold per feature so score and applied split are always consistent
- each leaf stores one predicted value per target
predict(...)returns a 2-D array of shape(n_rows, n_targets)- compatible with
sample_weight
Multi-target training uses algorithm="dt". Multi-target support for rf and
gbm is not currently available.
Stopping and control parameters¶
max_depthmin_samples_splitmin_samples_leafmax_featuresseed
These parameters materially affect learning across the supported tree families, including forest constituent trees.
The rationale behind these controls is straightforward:
max_depthlimits structural complexity directly and is the cleanest way to cap traversal costmin_samples_splitprevents very small unstable internal nodesmin_samples_leafprevents highly confident leaves built from tiny supportmax_featurescontrols correlation and search width, which matters much more in ensembles than in a single treeseedis there because the library deliberately uses randomness in several places: bootstrapping, feature subsampling, randomized splits, and boosting row sampling
The intended semantics of seed are:
- fixed seed -> repeatable training for the same dataset and configuration
- different seed -> legitimately different stochastic choices
That randomness is not left ad hoc inside each learner. ForestFire now uses a shared seed-mixing path for:
- forest tree seeds
- boosting stage seeds
- node-local feature-subset sampling
- randomized threshold selection
One practical detail matters here: node-local randomization is derived from the rows owned by the node, but not from the incidental order of the mutable row-index buffer. That makes the stochastic behavior more stable under internal partitioning details while remaining sensitive to the actual training data seen at each node.
Canaries¶
ForestFire uses canary features to stop growth during training.
For the full rationale and algorithm-specific behavior, see Canary Strategy.
Behavior:
- standard trees stop at the current node if no acceptable real split survives canary competition
- oblivious trees stop further depth growth if no acceptable real level-split survives canary competition
- gradient boosting stops adding stages if no acceptable real root split survives canary competition
- random forests ignore canaries during tree training
By default, that competition is strict:
filter=Nonebehaves likefilter=1- the top-ranked candidate must already be a real feature
You can soften that rule with filter=...:
filter=3means the chosen real split must appear within the top 3 scored candidatesfilter=0.95means the chosen real split must appear within the top 5% of scored candidates
In both cases, candidates are still scored and sorted in the usual way first. The difference is only which ranked window is allowed to contain the chosen real feature.
Why canaries exist:
- standard tree learners can always continue to fit noise if the stopping rule only looks at local impurity decrease
- a canary feature is a shuffled copy of real signal space, so if the learner prefers it, the current split quality is no better than structured noise
- this turns stopping into a training-time competition between real and noise-like features instead of a later pruning pass
Why random forests ignore them:
- forests already reduce overfitting primarily through bootstrapping and feature subsampling
- reintroducing canaries inside each tree would often stop growth for the wrong reason, because each constituent tree already sees a deliberately restricted view of the data
- in boosting, the opposite tradeoff is useful: canaries remain active because stage-wise residual fitting is much more willing to chase noise late in training
Random forests¶
- bootstrap sampling per tree
- feature subsampling per node
- optional OOB score computation
The random-forest implementation is intentionally close to the classical idea:
- each tree trains on a bootstrap sample
- each node considers only a sampled feature subset
- predictions are aggregated by averaging leaf outputs or probabilities
The important design choice is that preprocessing is still shared. Trees do not each rebucket the raw data. They all operate over the same preprocessed representation, which keeps training semantics aligned and avoids duplicated preprocessing work.
Gradient boosting¶
- second-order trees trained from gradients and Hessians
learning_rate- optional stage bootstrapping
- LightGBM-style gradient-focused row sampling via:
top_gradient_fractionother_gradient_fraction
The boosting implementation is LightGBM-like in spirit, but not a clone.
It keeps the ideas that matter most:
- second-order tree fitting from gradients and Hessians
- shrinkage via
learning_rate - optional row sampling
- focusing stage fitting on high-gradient rows
But it keeps a ForestFire-specific stopping rule:
- canaries still participate in split selection
- if no real root split survives inside the allowed canary
filterwindow, that stage is discarded and boosting stops - the first boosting stage can optionally retry with
boosting_first_stage_retry_filter - that retry filter defaults to
TopN(1), so it preserves the strict stop unless you widen it explicitly
That choice reflects the project’s general design preference: use explicit training-time noise competition as a stopping signal instead of bolting on a separate pruning or early-stopping layer later.
How second-order tree fitting works¶
The boosting stage loop is serial, but the tree builder inside one stage is now organized so it can become more parallel internally.
That distinction matters:
- stage
k + 1depends on the predictions produced after stagek - the nodes inside the single tree fitted at stage
kdo not all depend on one another in the same way
ForestFire therefore treats "fit one boosting stage" and "grow one tree inside that stage" as separate scheduling problems.
At a high level, one boosting stage does this:
- compute per-row gradients and Hessians from the current ensemble prediction
- fit one second-order tree against those gradient/Hessian pairs
- scale that tree by
learning_rate - add it to the ensemble
- repeat until the stopping rule rejects the next stage
The second-order tree itself uses Newton-style leaf values. For any node, the leaf prediction is derived from the node-local gradient and Hessian totals:
leaf_value = -G / (H + lambda)
where:
Gis the sum of gradients in the nodeHis the sum of Hessians in the nodelambdais the L2 regularization term
The split score asks whether dividing the current rows into left and right children increases the regularized objective strength enough to justify growing the tree.
The active-node frontier¶
CART and randomized trees — regression, classification, and gradient boosting alike — now grow with an active-node frontier instead of depth-first recursion.
The frontier is simply the set of nodes at the current depth that are still eligible to split.
That means tree growth now looks like this:
- start with the root in the frontier
- evaluate every node in that frontier
- decide which nodes actually split
- partition rows for those winning splits
- create all children
- use those children as the next frontier
This is different from the older depth-first pattern:
- evaluate one node
- partition its rows immediately
- recurse into the left child
- recurse into the right child
- only later return to same-depth siblings
The frontier-based layout is important because it separates node evaluation from row-buffer mutation.
In the standard binary tree path, ForestFire still uses one shared mutable row-index buffer rather than copying rows into separate node-local allocations. Each active node owns a contiguous slice of that buffer. During evaluation, the builder reads that slice, computes node statistics, constructs or reuses histograms, and scores candidate features. It does not mutate the slice yet.
Only after the whole current frontier has been evaluated does the builder begin partitioning rows for the nodes that actually split.
That separation is the architectural point of the frontier:
- evaluation is "read the current node state and score split candidates"
- partitioning is "rewrite the shared row-index slice so left/right children occupy contiguous ranges"
- child creation is "record those new ranges as the next active frontier"
The standard CART/randomized path now uses that boundary in three explicit phases for each depth:
- evaluate the current frontier in parallel
- partition row ranges for the nodes that actually split in parallel across disjoint row-buffer slices
- build child histograms for the next frontier in parallel
That means the frontier is no longer only a structural preparation for future parallelism. It is now the execution model used by all standard CART/randomized builders: regression, classification, and gradient boosting.
What is evaluated for each active node¶
For every active node, the builder computes:
- sample count
- gradient sum
- Hessian sum
- the node leaf prediction
- the node objective strength before splitting
- feature histograms for the node rows
- the best surviving split candidate after canary filtering
The stopping checks happen at this evaluation stage:
- max depth
min_samples_split- non-positive or too-small Hessian mass
- no valid feature split after
min_samples_leafandmin_sum_hessian_in_leaf - canary competition blocking the node at the root
If a node does not produce a valid split, it becomes a leaf and never enters the next frontier.
If it does split, the builder retains enough information to mutate the row buffer afterward without rescoring the node.
When training parallelism is enabled, that evaluation phase runs across the whole active frontier at once. Each node still scores its own candidate features using the same histogram-based split logic as before, but same-depth nodes can now do that work concurrently.
The later mutation phase is now parallel too. Pending same-depth splits are sorted by their owned row ranges, then partitioned through a recursive divide-and-conquer pass over disjoint slices of the shared row-index buffer. That keeps the in-place row-buffer model intact while allowing non-overlapping partitions to run concurrently.
Oblique splitting now participates in that same frontier flow for standard
cart and randomized trees across all learner families.
In that mode:
- axis-aligned and oblique candidates are ranked together at each active node
- oblique candidates are still sparse pairwise linear splits
- all candidate feature pairs available at the node are considered
- per-feature missing routing is learned for the two participating features and replayed at inference time
Histograms and child reuse¶
The second-order tree path is histogram-based.
For each feature and node, the histogram stores:
- row count
- gradient sum
- Hessian sum
Those histograms are what let the builder score candidate thresholds without rescanning raw rows for every possible split.
When a node is partitioned, the implementation tries to avoid rebuilding both children from scratch:
- it builds histograms for the smaller child directly from that child’s rows
- it derives the sibling histograms by subtracting the smaller child from the parent histograms
That matters because histogram construction is one of the dominant training costs. Reusing the parent histogram avoids paying that full cost twice per split.
That reuse now happens inside the frontier pipeline as well. After partitioning has produced the child row ranges for every winning split at the current depth, the builder constructs the next frontier’s histograms in parallel across those splitting nodes.
Why the frontier matters for parallelism¶
The frontier provides both the execution boundary and the actual parallel work for all standard CART/randomized trees.
Without a frontier, same-depth siblings are entangled with mutation order:
- node
Ais evaluated - node
Apartitions the shared row buffer - node
Arecurses into its children - only later does node
Bget evaluated
That ordering makes same-depth batching awkward because one branch has already started rewriting shared training state before its siblings have even been scored.
With a frontier, every node at a depth is evaluated against the same stable view of the current row ownership. That makes several future steps much cleaner:
- scoring active nodes in parallel
- feature-parallel split search inside each node evaluation
- partitioning disjoint same-depth row ranges in parallel once split choices are known
- building child histograms across splitting nodes in parallel after partitioning
- postponing row partitioning until after split selection is complete
For gradient boosting, the stage loop still remains serial. The frontier only changes the work inside one stage’s tree fit.
That is the intended balance:
- keep stage semantics simple and correct
- make intra-tree work increasingly parallel over time
Relationship to oblivious trees¶
Oblivious second-order trees were already level-wise by construction because every node at a given depth shares the same feature/threshold pair.
The active-node frontier brings all standard CART/randomized paths closer to that same batching style, but without changing the standard-tree semantics:
- standard trees still choose splits independently per node
- oblivious trees still choose one shared split per depth
What they share is the key scheduling idea that depth-level work can be evaluated as a batch before the next level is created.
Training optimizations¶
ForestFire training is optimized around a compact binned core and shared row-index buffers rather than row copies.
- numeric features are pre-binned into compact integer ranks, capped at
512bins - each feature reserves one extra missing bin alongside the observed bins
bins="auto"chooses the highest populated power-of-two count per feature while keeping at least two rows in every realized binhistogram_bins=...can override the numeric resolution used during split search without requiring callers to rebuild the sourceTable- long-running training and prediction release the Python GIL before entering the Rust hot path
- CART and randomized trees use histogram-based numeric split search
- standard binary trees partition row indices in place
- ID3 and C4.5 use the same in-place row-buffer approach for multiway branches
- oblivious trees train over shared row buffers plus per-leaf ranges
- CART/randomized classification and mean-regression builders reuse parent histograms and derive sibling histograms by subtraction
- oblivious split scoring reuses cached per-leaf counts or
sum/sum_sq - random forests parallelize across trees while limiting intra-tree parallelism
- all standard CART/randomized trees (regression, classification, and gradient boosting) now grow through a level-wise active-node frontier, with parallel frontier evaluation, feature-parallel split scoring, parallel row partitioning over disjoint slices, and parallel child-histogram construction separated by explicit frontier phases
- binary sparse inputs stay sparse through training and inference
- classifier, regressor, and second-order tree builders now share the same core histogram/partitioning/randomization helpers instead of carrying separate implementations of the same mechanics
Reading the snippets below¶
As in the runtime page, the snippets here are representative shapes rather than exact copies of every production function.
They are meant to show:
- the straightforward naïve way to structure the work
- the shape ForestFire actually optimizes for
- why the optimized shape is cheaper in memory traffic, allocation, or repeated computation
The real implementations live across crates/data/src and crates/core/src/tree, but these snippets are the shortest path to understanding the architecture.
1. Binning once instead of rescanning raw floats at every node¶
Naïve approach¶
fn best_threshold_naive(values: &[f64], rows: &[usize]) -> Option<f64> {
let mut best = None;
for &candidate_row in rows {
let threshold = values[candidate_row];
let (left, right): (Vec<_>, Vec<_>) = rows
.iter()
.copied()
.partition(|&row| values[row] <= threshold);
let score = score_split_from_raw_values(values, &left, &right);
best = pick_better(best, (threshold, score));
}
best.map(|(threshold, _)| threshold)
}
This rescans raw floating-point values and repartitions rows for every candidate threshold.
ForestFire approach¶
fn best_threshold_binned(
binned_values: &[u16],
histogram: &[HistogramBin],
) -> Option<u16> {
let mut best = None;
let mut left = HistogramBin::default();
let total = histogram_total(histogram);
for (bin, bucket) in histogram.iter().enumerate() {
left = left + bucket;
let right = total - left;
let score = score_split_from_histograms(&left, &right);
best = pick_better(best, (bin as u16, score));
}
best.map(|(bin, _)| bin)
}
Why this helps:
- candidate space is bounded by the realized bins
- split scoring works from aggregated counts or sums rather than rescanning rows
- the same binned representation is reused by every learner family and later by optimized inference
2. In-place row partitioning instead of copying child row vectors¶
Naïve approach¶
fn split_rows_naive(rows: &[usize], feature: usize, threshold: u16, table: &dyn TableAccess) -> (Vec<usize>, Vec<usize>) {
rows.iter()
.copied()
.partition(|row| table.binned_value(feature, *row) <= threshold)
}
This allocates new child vectors at every split.
ForestFire approach¶
fn split_rows_in_place(
rows: &mut [usize],
feature: usize,
threshold: u16,
table: &dyn TableAccess,
) -> usize {
let mut boundary = 0usize;
for idx in 0..rows.len() {
if table.binned_value(feature, rows[idx]) <= threshold {
rows.swap(boundary, idx);
boundary += 1;
}
}
boundary
}
Now one mutable row-index buffer is reused throughout growth:
rows[..boundary]is the left childrows[boundary..]is the right child
Why this helps:
- far fewer allocations
- better cache locality on the row-index buffer
- easier reuse of the same data structure across CART, randomized, and second-order trees
3. Histogram subtraction instead of rebuilding sibling statistics¶
Naïve approach¶
fn child_histograms_naive(
left_rows: &[usize],
right_rows: &[usize],
table: &dyn TableAccess,
) -> (Vec<HistogramBin>, Vec<HistogramBin>) {
(
build_histograms_from_rows(left_rows, table),
build_histograms_from_rows(right_rows, table),
)
}
This does full work for both children even though the parent histogram is already known.
ForestFire approach¶
fn child_histograms_subtractive(
parent: &[HistogramBin],
left: &[HistogramBin],
) -> (Vec<HistogramBin>, Vec<HistogramBin>) {
let right = parent
.iter()
.zip(left.iter())
.map(|(parent_bin, left_bin)| parent_bin.subtract(left_bin))
.collect::<Vec<_>>();
(left.to_vec(), right)
}
Why this helps:
- one child histogram can be built directly
- the sibling becomes derived work instead of independent work
- the savings compound at every internal node in binary tree growth
4. Shared row buffers for oblivious training¶
Naïve approach¶
fn grow_oblivious_naive(leaves: &[Vec<usize>], split: Split, table: &dyn TableAccess) -> Vec<Vec<usize>> {
let mut next = Vec::new();
for leaf_rows in leaves {
let (left, right) = leaf_rows
.iter()
.copied()
.partition(|row| table.binned_value(split.feature_index, *row) <= split.threshold_bin);
next.push(left);
next.push(right);
}
next
}
This keeps allocating fresh vectors per leaf at every depth.
ForestFire approach¶
struct LeafRange {
start: usize,
end: usize,
}
fn grow_oblivious_in_place(
rows: &mut [usize],
leaves: &[LeafRange],
split: Split,
table: &dyn TableAccess,
) -> Vec<LeafRange> {
let mut next = Vec::with_capacity(leaves.len() * 2);
for leaf in leaves {
let boundary = split_rows_in_place(
&mut rows[leaf.start..leaf.end],
split.feature_index,
split.threshold_bin,
table,
);
next.push(LeafRange {
start: leaf.start,
end: leaf.start + boundary,
});
next.push(LeafRange {
start: leaf.start + boundary,
end: leaf.end,
});
}
next
}
Why this helps:
- one row buffer is reused for the whole tree
- leaves are tracked by ranges instead of fresh row vectors
- this matches the symmetric structure of oblivious trees and keeps the implementation regular
5. Deterministic randomization through shared seed mixing¶
Naïve approach¶
let mut rng = StdRng::seed_from_u64(user_seed);
let threshold = candidates[rng.gen_range(0..candidates.len())];
This makes results depend on the accidental order in which random draws happen across nodes, trees, or stages.
ForestFire approach¶
let tree_seed = mix_seed(base_seed, tree_index as u64);
let node_seed = node_seed(tree_seed, depth, salt, rows);
let threshold = choose_random_threshold(&candidates, feature_index, rows, node_seed)?;
Why this helps:
- fixed seed means repeatable training
- different trees, stages, and nodes still get different random contexts
- node-local choices depend on the row set, not on incidental temporary ordering
That is important for randomized trees, forests, and gradient boosting stages alike.
6. Parallelize across trees where independence is real¶
Naïve approach¶
let trees = (0..n_trees)
.map(|tree_index| train_one_tree(sample_for(tree_index)))
.collect::<Result<Vec<_>, _>>()?;
This is correct, but it leaves independent bootstrap trees serialized.
ForestFire approach¶
let trees = (0..n_trees)
.into_par_iter()
.map(|tree_index| train_one_tree(sample_for(tree_index)))
.collect::<Result<Vec<_>, _>>()?;
Why this helps:
- forest trees are naturally independent training jobs
- the speedup lands at the algorithm level without changing single-tree semantics
- intra-tree work can stay conservative while inter-tree work scales out
ForestFire deliberately prefers this shape for forests because the independence is obvious and the coordination cost is low.
7. Stage-wise boosting on shared preprocessed data¶
Naïve approach¶
for stage in 0..n_trees {
let residual_table = rebuild_table_from_current_residuals(x, residuals);
let tree = train_tree(&residual_table)?;
update_predictions(&mut predictions, &tree);
}
This rebuilds too much state per stage.
ForestFire approach¶
for stage in 0..n_trees {
let (gradients, hessians) = compute_gradients_and_hessians(&predictions, targets);
let sampled_rows = gradient_focus_sample(base_rows, &gradients, &hessians, ...);
let sampled_table = SampledTable::new(train_set, sampled_rows.row_indices);
let tree = train_second_order_tree(&sampled_table, &sampled_rows.gradients, &sampled_rows.hessians)?;
update_predictions(&mut predictions, &tree, learning_rate);
}
Why this helps:
- the same preprocessed base table is reused
- only gradients, Hessians, and sampled row views change per stage
- stage computation stays focused on residual structure instead of rebuilding the whole world
8. Keep sparse binary inputs sparse¶
Naïve approach¶
fn sparse_to_dense(x: SparseInput) -> Vec<Vec<f64>> {
let mut dense = vec![vec![0.0; x.n_features]; x.n_rows];
for (feature, rows) in x.nonzero_rows {
for row in rows {
dense[row][feature] = 1.0;
}
}
dense
}
This pays the full dense memory cost before training even starts.
ForestFire approach¶
struct SparseBinaryColumn {
row_indices: Vec<usize>,
}
struct SparseTable {
columns: Vec<SparseBinaryColumn>,
n_rows: usize,
}
Why this helps:
- storage scales with positive entries, not full shape
- binary-feature semantics stay explicit
- the same sparse structure can feed both training and inference paths
9. Share core mechanics across learner families¶
Naïve approach¶
mod classifier_training {
fn build_histograms(...) { ... }
fn partition_rows(...) { ... }
}
mod regressor_training {
fn build_histograms(...) { ... }
fn partition_rows(...) { ... }
}
mod second_order_training {
fn build_histograms(...) { ... }
fn partition_rows(...) { ... }
}
This invites drift:
- one path gets faster
- another keeps the old bug or old allocation pattern
- randomization semantics stop lining up
ForestFire approach¶
mod tree::shared {
fn candidate_feature_indices(...) { ... }
fn partition_rows_for_binary_split(...) { ... }
fn choose_random_threshold(...) { ... }
fn mix_seed(...) { ... }
}
Why this helps:
- performance-sensitive fixes land once
- first-order and second-order trees stay behaviorally aligned
- the codebase can evolve without silently splitting into several incompatible training engines
Why these optimizations matter:
- binning turns repeated threshold search into a bounded discrete problem, which makes both scoring and runtime lowering much cheaper
- shared row-index buffers avoid per-node row-vector allocation, which is one of the main hidden costs in naive tree builders
- histogram subtraction matters because sibling statistics are not independent work; once one child is known, the other is often derivable from the parent
- shared tree internals matter because performance-sensitive fixes to histogram handling, partitioning, and stochastic split selection now land once and propagate across first-order and second-order learners together
- keeping sparse binary inputs sparse avoids paying a dense-memory penalty for data that is structurally sparse
Impact in practice:
- better cache behavior during split search
- less allocator pressure during recursive growth
- more predictable runtime layouts for optimized inference
- much better scaling in forests, where repeated tree training amplifies every avoidable allocation and recounting cost