Categorical Strategies¶

ForestFire currently supports three practical categorical strategies through the normal native train(...) interface:

dummy
target
fisher

These strategies are now implemented in Rust, not as Python-side preprocessing. Python sends mixed rows directly into the native training path, and Rust applies the configured categorical transform before handing the resulting representation to the tree/forest/GBM trainers.

The implementation is intentionally split by responsibility:

dummy and target live close to the table layer because they are fundamentally table-to-table feature transforms
fisher lives in core because it is not just a generic encoding; it is a target-informed category-ordering strategy tied to tree split semantics

That said, the current system is still based on transformed numeric/binary features rather than fully native categorical predicates in the IR and runtime.

This page explains what each strategy means, when it is appropriate, and what its tradeoffs are.

Why categorical handling is a separate topic¶

Tree libraries often make a sharp distinction between:

numeric features
binary features
categorical features

That distinction matters because categories are labels, not magnitudes.

If a column contains values such as:

red
blue
green

then the learner should not assume:

green > blue
blue is “between” red and green
adjacent integer ids imply semantic similarity

The central problem is therefore not just storage. It is split semantics.

Different strategies solve that problem in different ways:

dummy expands one categorical source column into several binary indicators
target replaces each category with a target-derived statistic
fisher orders categories by target-derived statistics and then treats that order as a numeric feature for threshold splits

None of those is universally correct. They are practical approximations with different bias and variance behavior.

Strategy Selection¶

The rough rule of thumb is:

use dummy for low-cardinality nominal features
use target for higher-cardinality features when width growth would become expensive
use fisher when you want a LightGBM-like ordered-category split surrogate for binary-threshold tree learners

If you are unsure:

start with dummy for small vocabularies
consider target or fisher once category counts become large enough that width expansion is no longer attractive

`dummy`¶

dummy is the renamed public term for the classic one-of-K compatibility encoding.

For a source column like:

red
blue
green

the encoded representation becomes a set of binary features:

is_red
is_blue
is_green
plus an extra unknown/missing indicator in the current implementation

So a row with category blue becomes something like:

0, 1, 0, 0

and an unseen or missing category becomes:

0, 0, 0, 1

Why the name `dummy`¶

The term “dummy variable” is standard statistical language, and it avoids mixing the public API name with the internal implementation detail that the expansion is effectively one-hot-like.

Why `dummy` is useful¶

This strategy is the lowest-risk way to add categorical support to existing binary split learners because it reuses machinery that already works well:

binary feature handling
binary histogram construction
ordinary threshold or boolean split logic

That makes it a strong baseline when:

categories are nominal
cardinality is low
interpretability of individual category indicators is useful

Strengths¶

semantically safe for nominal categories
easy to reason about
low implementation risk
naturally compatible with existing binary-feature tree code
explicit unknown-category fallback can be represented as its own indicator

Weaknesses¶

feature width grows with category count
max_features behavior can get distorted because one source feature becomes many derived columns
grouped category rules may require multiple tree levels

That last point matters. A native categorical split could represent:

red or green -> left
blue -> right

in one node.

A dummy encoding often needs multiple binary tests to recover the same logic.

Best use cases¶

small vocabularies
mostly nominal categories
compatibility-oriented first pass
cases where model width growth is acceptable

`target`¶

target encoding replaces each category with a statistic derived from the training target.

Conceptually:

for regression, categories are mapped to smoothed target means
for classification, categories are mapped to smoothed class probabilities

So instead of turning a feature with 500 categories into 500 indicator columns, the strategy turns it into one or a few numeric columns.

Why this helps¶

This is attractive when cardinality is high because the representation size no longer grows linearly with the number of categories.

That means:

narrower feature matrices
less pressure on feature subsampling
simpler downstream split search

Smoothing¶

Naive target encoding is dangerous because rare categories can overfit badly.

If a category appears once and that row’s target is positive, a naive encoding would treat that category as maximally positive. That is usually too confident.

ForestFire’s current target strategy therefore uses smoothing toward a global prior.

Conceptually the encoded value is:

a blend of the category-local estimate
and the overall target prior

The target_smoothing parameter controls how strongly rare categories are pulled back toward the global prior.

Larger smoothing means:

more shrinkage toward the global average
less variance on rare categories
more bias for legitimately strong but infrequent levels

Smaller smoothing means:

more category-specific flexibility
more sensitivity to rare-level noise

This parameter is part of categorical modeling itself, not part of the canary policy.

In other words:

target_smoothing defines the base amount of shrinkage used by categorical training
canaries may later increase the effective smoothing for target and fisher when shuffled-category signal is too competitive

So if you are trying to understand what the parameter means, this is the primary section to read. The canaries page only explains when ForestFire may adapt that base value upward for robustness.

Classification behavior¶

For classification, the encoder currently emits one value per class. That means a categorical feature can expand to:

one encoded column in binary classification if the downstream path collapses it further later
or several numeric columns representing class-probability-like statistics

In the current implementation, these encoded values are then treated as numeric features by the tree learner.

Strengths¶

scales much better than dummy on high-cardinality features
can capture useful target associations compactly
unknown categories can fall back cleanly to the global prior

Weaknesses¶

target leakage is a real conceptual risk in general target encoding
encoded values depend on the training target, not just feature structure
category meaning becomes less directly interpretable

The current implementation is a practical smoothed training-time encoding, not a full cross-fitting or CatBoost-style ordered-statistics system.

That means it is useful now, but it should be understood as an initial pragmatic implementation rather than the final leakage-minimizing design.

Best use cases¶

medium or high-cardinality categorical columns
regression problems
classification problems where width growth would become too large under dummy
cases where compactness matters more than strict interpretability of raw category rules

`fisher`¶

fisher is an ordered-category strategy inspired by the category ordering idea used by systems like LightGBM for histogram-based categorical split search.

The basic idea is:

compute target-derived statistics per category
order categories by those statistics
assign each category a numeric rank
let the tree learner split over that ordered axis with ordinary threshold logic

So categories are not expanded into many columns. They become a single ordered numeric surrogate feature.

Why this is useful¶

This gives binary split learners a way to separate categories using one threshold over a learned ordering rather than many indicator columns.

That can be much more compact than dummy while still behaving more like a categorical split than arbitrary integer ids would.

The important distinction is that the order is not user-supplied or arbitrary. It is derived from target statistics.

That is what makes this strategy categorically motivated rather than just naive ordinal encoding.

Why “Fisher”¶

The name here is meant to describe category separation by target-informed ordering and threshold scan, not to claim a full clone of any single external library’s exact categorical algorithm.

The current implementation uses target-derived category ordering and then maps that order to ranks consumed by the existing numeric split machinery.

Although fisher is not itself target encoding, it still depends on target-derived category statistics. For that reason, the same target_smoothing parameter is also used as the base regularization level for the per-category statistics that determine the ordering.

Strengths¶

compact representation
often more expressive than dummy for medium-cardinality categoricals
closer in spirit to native categorical threshold ordering than arbitrary integer labels
fits naturally into binary threshold tree learners

Weaknesses¶

still not a fully native categorical subset split
quality depends on how informative the target-derived ordering is
can be unstable when categories are extremely rare
like target, it depends on target-derived statistics

Best use cases¶

medium-cardinality categorical features
boosted trees and other binary split learners where grouped category separation matters
cases where dummy would be too wide but full native categorical support is not available yet

Unknown And Missing Categories¶

Current behavior depends on the strategy:

dummy: unseen or missing categories activate the dedicated unknown/missing indicator
target: unseen or missing categories fall back to the global prior
fisher: unseen or missing categories fall back to NaN, which then follows the library’s existing missing-value handling

That means unknown-category behavior is explicit rather than silently collapsing into one observed category.

Interaction With ForestFire Training¶

After categorical handling runs, the downstream trainers still see ordinary numeric or binary feature matrices.

So all existing training behavior still applies:

trees, forests, and GBM all use the same transformed inputs
canaries still compete against transformed features
histogram-based split search still works the same way after encoding
missing-value handling still follows the numeric/binary semantics of the transformed representation

This is both the strength and the limitation of the current design:

it makes categorical strategies available through the same native training path as the rest of the library
but the model is still learning on transformed features, not on native categorical predicates

At the API level, this means there is no separate categorical training entry point. You still call train(...) and pass categorical options directly:

train(
    X,
    y,
    task="classification",
    tree_type="cart",
    categorical_strategy="dummy",
)

The same applies to the sklearn-style wrappers, which forward the categorical configuration through the same native path.

Serialization And IR¶

Categorical models now serialize their transform contract explicitly.

That includes:

semantic JSON serialization
IR export
compiled optimized serialization

The important design point is that the learned trees still live in encoded feature space, while the serialized categorical metadata preserves the raw input contract and the transform needed to reach that encoded space.

So a serialized categorical model records both:

the raw categorical/numeric inputs it expects
the categorical transform required before tree evaluation

This is enough to round-trip categorical models without asking callers to rebuild the transform manually.

The remaining limitation is semantic rather than transport-related:

categorical strategies are now preserved in IR and serialization
but the learned trees still operate on transformed numeric/binary features, not on native categorical subset predicates

Practical Guidance¶

If you want a simple starting point:

use dummy when category counts are small

If width becomes a problem:

try target

If you want a compact LightGBM-like category ordering surrogate for binary split learners:

try fisher

There is no universal winner. The right choice depends on:

cardinality
rare-category frequency
whether interpretability matters
whether width or leakage risk is the bigger practical concern

Categorical Strategies¶

Why categorical handling is a separate topic¶

Strategy Selection¶

dummy¶

Why the name dummy¶

Why dummy is useful¶

Strengths¶

Weaknesses¶

Best use cases¶

target¶

Why this helps¶

Smoothing¶

Classification behavior¶

Strengths¶

Weaknesses¶

Best use cases¶

fisher¶

Why this is useful¶

Why “Fisher”¶

Strengths¶

Weaknesses¶

Best use cases¶

Unknown And Missing Categories¶

Interaction With ForestFire Training¶

Serialization And IR¶

Practical Guidance¶

`dummy`¶

Why the name `dummy`¶

Why `dummy` is useful¶

`target`¶

`fisher`¶