Canary Strategy¶

ForestFire uses canary features as a training-time noise baseline during tree growth.

This is one of the project’s strongest design opinions, and it affects how you should think about stopping, regularization, and ensemble behavior.

What a canary is¶

A canary feature is a shuffled copy of an already-preprocessed real feature.

That means:

it looks statistically plausible
it has the same value domain as a real feature
it carries no real relationship to the target

The point is not to create obviously fake noise. The point is to create a realistic competitor that answers a useful question:

is the best available split better than structured noise?

If the answer is no, the learner should usually stop.

Why ForestFire uses canaries¶

Classic tree growth has a familiar problem:

if you keep searching for local impurity improvement, you can usually keep finding small “improvements” in noise
those improvements may be mathematically real on the training sample while still being statistically meaningless

A canary turns that into a direct competition.

Instead of asking only:

“does this split improve the objective?”

ForestFire also effectively asks:

“does this split beat a shuffled feature drawn from the same feature space?”

That changes the role of stopping.

Stopping is no longer just:

max depth
minimum samples
minimum gain
post-hoc pruning

It becomes part of split selection itself.

What canaries replace¶

Canaries partially replace several traditional “don’t let the tree grow too far” mechanisms.

In practice they take over part of the job usually handled by:

pruning
overly aggressive depth caps
large minimum-leaf settings used purely as a noise guard
minimum-gain heuristics used only to prevent tiny accidental improvements

Why this is useful:

the stopping signal is data-dependent
the threshold is contextual rather than globally fixed
the learner compares real features to realistic noise, not to an arbitrary constant

That makes the stop decision more adaptive than saying:

“never go deeper than d”
“never split below n rows”
“never accept gain below g”

Those controls are still useful, but they are not trying to answer exactly the same question.

What canaries do not replace¶

Canaries do not make the standard structural hyperparameters obsolete.

They do not replace:

max_depth
min_samples_split
min_samples_leaf
max_features
seed

Why not:

`max_depth`¶

max_depth caps complexity directly.

Even if every split is better than a canary, you may still want:

bounded latency
bounded model size
bounded explanation depth

Canaries are about “is this split better than noise?”

max_depth is about “how large am I willing to let this object become?”

Those are different questions.

`min_samples_split`¶

This prevents the learner from even considering splits in nodes that are too small to support stable estimates.

Canaries help judge signal quality, but they do not repair the fact that tiny nodes are high-variance objects.

`min_samples_leaf`¶

This constrains how concentrated confidence can become.

Even if a split beats a canary, a leaf with only a handful of rows may still be operationally undesirable for:

calibration
interpretability
robustness

`max_features`¶

This changes the search space itself.

Canaries decide whether growth should continue. max_features decides how wide the competition set is at each node or level.

That matters especially in ensembles.

`seed`¶

Canaries are shuffled copies. Randomized trees, forests, and boosting also use randomness elsewhere.

If you care about reproducibility, you still need the seed.

How canaries affect standalone trees¶

For ordinary dt training, canaries act as a local growth guard.

Effectively:

if an acceptable real feature survives the canary policy, the tree can continue
if the allowed competition window contains only canaries, that node should stop

This matters because standalone trees are the most exposed to “keep splitting because you can” behavior.

In that setting, canaries let ForestFire be less dependent on post-hoc pruning.

Categorical variables¶

Canaries now also play a second role for categorical handling.

For ordinary numeric and binary training, canaries act mainly as a split acceptance baseline:

is this candidate split better than structured noise?

For categorical encodings, there is an additional failure mode:

a target-informed encoding can manufacture convincing-looking structure from sparse or noisy categories even before the tree starts choosing splits

That matters most for:

target
fisher

and much less cleanly for:

dummy

Why categorical encodings need extra caution¶

target and fisher both use the training target while building the feature representation itself.

That means a noisy categorical column can still look strong:

target may give rare categories extreme encoded values
fisher may produce a clean-looking category ordering that is mostly luck

If the library only asked whether the resulting transformed feature can win a split later, it would be evaluating the encoding after some of the overfit risk has already been baked in.

Current behavior¶

ForestFire now uses canary-informed shrinkage for:

target
fisher

The basic idea is:

measure how much target-structured separation the real categorical feature appears to have under the current smoothing level
build a categorical canary by shuffling that feature while keeping the same category vocabulary and marginal counts
measure the same target-structured separation on the shuffled feature
increase smoothing when the canary remains too competitive with the real feature

So the canary is not just competing with the feature at split time. It is also used earlier as a robustness check on the encoding strength itself.

What gets adjusted¶

For target:

the effective smoothing can be increased on a per-feature basis

For fisher:

the same canary-informed smoothing adjustment is applied before computing the target-derived category ordering

That means weak or fragile category signal gets pulled harder toward the global prior before the downstream tree learner ever sees the transformed feature.

The base meaning of target_smoothing belongs to the categorical-feature documentation. The important canary-specific point is narrower:

canaries may increase the effective smoothing above the user-provided base level when shuffled-category signal looks too competitive

Why `dummy` is excluded¶

dummy is intentionally left out of this mechanism for now.

The reason is not that canaries are irrelevant there. The reason is that the comparison unit is awkward:

one source categorical feature expands into many derived indicator columns
a naive per-indicator canary comparison would not be statistically matched to the source feature as a whole

So dummy would need a grouped or source-feature-level canary treatment rather than the per-feature shrinkage rule used for target and fisher.

What this does and does not mean¶

This is a robustness mechanism, not a guarantee that leakage-like effects are fully solved.

It helps by making the encoding more conservative when shuffled categories can still produce suspiciously strong apparent signal. But it is still a heuristic shrinkage rule, not a full cross-fitting or ordered-statistics categorical scheme.

In practical terms, the current design means:

categorical canaries influence representation strength for target and fisher
ordinary canary competition still governs whether splits are allowed to win later in tree growth
dummy still uses the regular split-time canary mechanism only

The windowed canary policy¶

The important point is that ForestFire now separates two ideas that used to be fused together:

how candidates are scored and ranked
how strict the canary stopping rule is

Scoring still happens exactly the same way:

evaluate candidate splits
assign each candidate its usual objective score
sort candidates from best to worst

What filter changes is only the acceptance rule that runs after ranking.

Default behavior: `filter=None` or `filter=1`¶

The default policy is intentionally strict.

It means:

only the single top-ranked candidate is eligible
if that candidate is a real feature, training continues with it
if that candidate is a canary, the node or stage stops

This is the original “best split must beat noise directly” rule.

Integer windows: `filter=n`¶

If you pass an integer n, ForestFire allows a slightly softer rule:

rank all scored candidates as usual
inspect only the top n
choose the highest-ranked real feature inside that window

This means canaries are still allowed to rank ahead of the chosen real feature.

Example:

filter=3 means a canary may occupy rank 1
another canary may occupy rank 2
the best real feature at rank 3 is still allowed to split the node

If every candidate in the top 3 is a canary, the node still stops.

So filter=n is not “ignore canaries.”

It is:

“keep canaries as a guardrail”
“but allow the best real split to survive if it is still near the very top”

Fractional windows: `filter=alpha`¶

If you pass a float alpha in [0, 1), ForestFire interprets it as a top-window fraction of 1 - alpha.

Concretely:

let k be the number of scored candidates at the current decision point
compute ceil((1 - alpha) * k)
search only that top-ranked window for the first real feature

Example:

filter=0.95 means “accept the best real split inside the top 5% of ranked candidates”

That is useful when the number of scored candidates changes from node to node or from algorithm to algorithm and you want the canary rule to scale with the actual competition set rather than with a fixed absolute count.

What the window is counting¶

The filter window is computed over scored split candidates, not over the original raw columns in isolation.

That distinction matters because the candidate pool depends on context:

max_features may restrict which features are even considered
some features may be unavailable or unsplittable at a given node
oblivious trees score one shared level-split rather than independent per-node splits
boosting applies the same logic only at the root of the next stage

So the meaning of “top 5%” is:

top 5% of the candidates that were actually scored at that decision point

not:

top 5% of all features in the original dataset regardless of feasibility

What happens when the window contains only canaries¶

The answer is still “stop.”

That is the core invariant ForestFire keeps:

canaries remain part of the acceptance test
the filter only controls how much room a near-top real feature is given to survive
if no real feature survives inside that allowed window, growth is rejected

This keeps the canary mechanism meaningful while allowing a less all-or-nothing policy than “rank 1 must already be real.”

How canaries affect different tree families¶

The effect is not identical across tree families because the growth structure differs.

ID3 and C4.5¶

These are multiway classifiers.

Here, canaries matter as a branch-expansion gate:

if no real split survives inside the allowed top window, growth stops there

This is useful because multiway branching can create a large structural jump from a single split.

CART¶

For standard binary trees, the canary acts as the most direct local stop test.

This is the easiest setting to reason about:

if the allowed top-ranked window contains no real binary split, stop

Randomized trees¶

Randomized trees intentionally add stochasticity to split search.

That makes canaries especially useful, not less useful:

randomness can expose different candidate splits
canaries help decide whether those candidates are still meaningfully better than noise

The filter window is especially useful here because randomized search can move near-tied candidates around more than fully deterministic search does. A slightly wider window lets you keep the same “real split must still be near the top” principle without requiring the real split to win rank 1 every time.

Oblivious trees¶

Oblivious trees grow by level, not by arbitrary node.

That changes the interpretation:

a canary-only window does not just say “one node should stop”
it says “the next shared depth-level split is not justified”

So for oblivious trees, canaries naturally act as a depth-growth stop.

That is one reason they fit the oblivious setting well: the stopping signal is aligned with the level-wise growth pattern.

How canaries interact with random forests¶

Random forests are the one place where ForestFire intentionally disables canaries during tree training.

That is not an omission. It is a design choice.

Why:

forests already regularize heavily through bootstrapping
they also regularize through feature subsampling
each tree is intentionally weak, noisy, and diverse

If canaries were left active inside every forest tree, they would often stop growth for the wrong reason:

not because the data has no usable structure
but because each tree is already working with a restricted, noisy view of the problem

That would reduce useful ensemble diversity.

So in ForestFire:

standalone trees use canaries
random forests ignore them

This is why you should not interpret the absence of canaries in RF as inconsistent. It reflects a different regularization regime.

What replaces canaries in random forests¶

In random forests, the main anti-overfitting forces are:

bootstrap sampling
max_features
averaging across many trees
ordinary structural limits like depth and minimum samples

Those are stronger and more appropriate regularizers for bagged ensembles than canary competition inside each constituent tree.

How canaries interact with gradient boosting¶

Gradient boosting is the opposite case.

ForestFire keeps canaries active there on purpose.

Why:

boosting fits trees stage by stage to what the current ensemble still gets wrong
later stages are especially willing to chase tiny residual structure
that is exactly where “signal vs structured noise” becomes most important

So in boosting, canaries are not disabled. They become an early-stop signal for stage addition.

The boosting-specific rule¶

In ForestFire’s gradient boosting implementation, training stops when:

n_trees stages have been trained, or
the allowed root-level filter window contains no real split and only canaries survive there

There is one configurable escape hatch for the very first stage:

boosting_first_stage_retry_filter can retry stage 0 with a wider canary window
the default is still TopN(1), so nothing is relaxed unless you opt into a wider retry window
in Python, setting it to False disables that retry path entirely
in Rust, setting it to None disables that retry path entirely

That second rule is important.

It means:

the stage is not merely weakened
it is discarded entirely
boosting stops at that point

This is a strong statement about the residual problem:

if even the allowed top-ranked root candidates do not contain a real split that survives canary competition, continuing to add trees is not justified

Why the root matters in boosting¶

The root split is the cleanest stage-level test.

If the first split of a new tree cannot beat a canary, then the residual signal the stage is trying to capture is already below the noise baseline at the coarsest possible level.

That makes it a natural stopping rule for additive boosting.

It is also much cheaper and easier to reason about than trying to define a later, more entangled “the whole stage was too weak” criterion.

How this differs from classic early stopping¶

Many boosting systems rely on:

validation loss tracking
patience
explicit early stopping rounds

ForestFire’s canary logic is different.

It is:

training-internal
structure-aware
based on noise competition, not just held-out loss motion

That does not make validation-based early stopping useless. It means the library’s first line of defense is embedded directly in how the next stage is justified.

Practical implications by algorithm¶

Single trees¶

Canaries make the tree less likely to keep growing just because local impurity still improves slightly.

Random forests¶

Canaries are disabled because bagging and feature subsampling are already the intended anti-overfitting mechanism.

Gradient boosting¶

Canaries remain active and become a stage-level stop signal because boosting is especially prone to late-stage noise fitting.

How to think about hyperparameters when canaries exist¶

A useful mental model is:

canaries answer “is there still signal here?”
structural hyperparameters answer “how large, stable, and expensive am I willing to let the model become?”

That division is why both exist.

You should not treat canaries as a total replacement for ordinary controls. You should treat them as a better answer to one specific problem:

distinguishing real split signal from noise-like split opportunity

Why this is a design choice, not just a trick¶

The canary strategy reflects ForestFire’s broader design philosophy:

make semantics explicit
push regularization into training-time decision points
prefer noise-aware growth rules over “grow first, clean up later”

That is why canaries are not an optional afterthought in the codebase.

They are part of how the library defines justified tree growth.