Categorical Strategies¶
ForestFire currently supports three practical categorical strategies through
the normal native train(...) interface:
dummytargetfisher
These strategies are now implemented in Rust, not as Python-side preprocessing. Python sends mixed rows directly into the native training path, and Rust applies the configured categorical transform before handing the resulting representation to the tree/forest/GBM trainers.
The implementation is intentionally split by responsibility:
dummyandtargetlive close to the table layer because they are fundamentally table-to-table feature transformsfisherlives in core because it is not just a generic encoding; it is a target-informed category-ordering strategy tied to tree split semantics
That said, the current system is still based on transformed numeric/binary features rather than fully native categorical predicates in the IR and runtime.
This page explains what each strategy means, when it is appropriate, and what its tradeoffs are.
Why categorical handling is a separate topic¶
Tree libraries often make a sharp distinction between:
- numeric features
- binary features
- categorical features
That distinction matters because categories are labels, not magnitudes.
If a column contains values such as:
redbluegreen
then the learner should not assume:
green > blueblueis “between”redandgreen- adjacent integer ids imply semantic similarity
The central problem is therefore not just storage. It is split semantics.
Different strategies solve that problem in different ways:
dummyexpands one categorical source column into several binary indicatorstargetreplaces each category with a target-derived statisticfisherorders categories by target-derived statistics and then treats that order as a numeric feature for threshold splits
None of those is universally correct. They are practical approximations with different bias and variance behavior.
Strategy Selection¶
The rough rule of thumb is:
- use
dummyfor low-cardinality nominal features - use
targetfor higher-cardinality features when width growth would become expensive - use
fisherwhen you want a LightGBM-like ordered-category split surrogate for binary-threshold tree learners
If you are unsure:
- start with
dummyfor small vocabularies - consider
targetorfisheronce category counts become large enough that width expansion is no longer attractive
dummy¶
dummy is the renamed public term for the classic one-of-K compatibility
encoding.
For a source column like:
redbluegreen
the encoded representation becomes a set of binary features:
is_redis_blueis_green- plus an extra unknown/missing indicator in the current implementation
So a row with category blue becomes something like:
0, 1, 0, 0
and an unseen or missing category becomes:
0, 0, 0, 1
Why the name dummy¶
The term “dummy variable” is standard statistical language, and it avoids mixing the public API name with the internal implementation detail that the expansion is effectively one-hot-like.
Why dummy is useful¶
This strategy is the lowest-risk way to add categorical support to existing binary split learners because it reuses machinery that already works well:
- binary feature handling
- binary histogram construction
- ordinary threshold or boolean split logic
That makes it a strong baseline when:
- categories are nominal
- cardinality is low
- interpretability of individual category indicators is useful
Strengths¶
- semantically safe for nominal categories
- easy to reason about
- low implementation risk
- naturally compatible with existing binary-feature tree code
- explicit unknown-category fallback can be represented as its own indicator
Weaknesses¶
- feature width grows with category count
max_featuresbehavior can get distorted because one source feature becomes many derived columns- grouped category rules may require multiple tree levels
That last point matters. A native categorical split could represent:
red or green -> leftblue -> right
in one node.
A dummy encoding often needs multiple binary tests to recover the same logic.
Best use cases¶
- small vocabularies
- mostly nominal categories
- compatibility-oriented first pass
- cases where model width growth is acceptable
target¶
target encoding replaces each category with a statistic derived from the
training target.
Conceptually:
- for regression, categories are mapped to smoothed target means
- for classification, categories are mapped to smoothed class probabilities
So instead of turning a feature with 500 categories into 500 indicator columns, the strategy turns it into one or a few numeric columns.
Why this helps¶
This is attractive when cardinality is high because the representation size no longer grows linearly with the number of categories.
That means:
- narrower feature matrices
- less pressure on feature subsampling
- simpler downstream split search
Smoothing¶
Naive target encoding is dangerous because rare categories can overfit badly.
If a category appears once and that row’s target is positive, a naive encoding would treat that category as maximally positive. That is usually too confident.
ForestFire’s current target strategy therefore uses smoothing toward a global
prior.
Conceptually the encoded value is:
- a blend of the category-local estimate
- and the overall target prior
The target_smoothing parameter controls how strongly rare categories are
pulled back toward the global prior.
Larger smoothing means:
- more shrinkage toward the global average
- less variance on rare categories
- more bias for legitimately strong but infrequent levels
Smaller smoothing means:
- more category-specific flexibility
- more sensitivity to rare-level noise
This parameter is part of categorical modeling itself, not part of the canary policy.
In other words:
target_smoothingdefines the base amount of shrinkage used by categorical training- canaries may later increase the effective smoothing for
targetandfisherwhen shuffled-category signal is too competitive
So if you are trying to understand what the parameter means, this is the primary section to read. The canaries page only explains when ForestFire may adapt that base value upward for robustness.
Classification behavior¶
For classification, the encoder currently emits one value per class. That means a categorical feature can expand to:
- one encoded column in binary classification if the downstream path collapses it further later
- or several numeric columns representing class-probability-like statistics
In the current implementation, these encoded values are then treated as numeric features by the tree learner.
Strengths¶
- scales much better than
dummyon high-cardinality features - can capture useful target associations compactly
- unknown categories can fall back cleanly to the global prior
Weaknesses¶
- target leakage is a real conceptual risk in general target encoding
- encoded values depend on the training target, not just feature structure
- category meaning becomes less directly interpretable
The current implementation is a practical smoothed training-time encoding, not a full cross-fitting or CatBoost-style ordered-statistics system.
That means it is useful now, but it should be understood as an initial pragmatic implementation rather than the final leakage-minimizing design.
Best use cases¶
- medium or high-cardinality categorical columns
- regression problems
- classification problems where width growth would become too large under
dummy - cases where compactness matters more than strict interpretability of raw category rules
fisher¶
fisher is an ordered-category strategy inspired by the category ordering idea
used by systems like LightGBM for histogram-based categorical split search.
The basic idea is:
- compute target-derived statistics per category
- order categories by those statistics
- assign each category a numeric rank
- let the tree learner split over that ordered axis with ordinary threshold logic
So categories are not expanded into many columns. They become a single ordered numeric surrogate feature.
Why this is useful¶
This gives binary split learners a way to separate categories using one threshold over a learned ordering rather than many indicator columns.
That can be much more compact than dummy while still behaving more like a
categorical split than arbitrary integer ids would.
The important distinction is that the order is not user-supplied or arbitrary. It is derived from target statistics.
That is what makes this strategy categorically motivated rather than just naive ordinal encoding.
Why “Fisher”¶
The name here is meant to describe category separation by target-informed ordering and threshold scan, not to claim a full clone of any single external library’s exact categorical algorithm.
The current implementation uses target-derived category ordering and then maps that order to ranks consumed by the existing numeric split machinery.
Although fisher is not itself target encoding, it still depends on
target-derived category statistics. For that reason, the same
target_smoothing parameter is also used as the base regularization level for
the per-category statistics that determine the ordering.
Strengths¶
- compact representation
- often more expressive than
dummyfor medium-cardinality categoricals - closer in spirit to native categorical threshold ordering than arbitrary integer labels
- fits naturally into binary threshold tree learners
Weaknesses¶
- still not a fully native categorical subset split
- quality depends on how informative the target-derived ordering is
- can be unstable when categories are extremely rare
- like
target, it depends on target-derived statistics
Best use cases¶
- medium-cardinality categorical features
- boosted trees and other binary split learners where grouped category separation matters
- cases where
dummywould be too wide but full native categorical support is not available yet
Unknown And Missing Categories¶
Current behavior depends on the strategy:
dummy: unseen or missing categories activate the dedicated unknown/missing indicatortarget: unseen or missing categories fall back to the global priorfisher: unseen or missing categories fall back toNaN, which then follows the library’s existing missing-value handling
That means unknown-category behavior is explicit rather than silently collapsing into one observed category.
Interaction With ForestFire Training¶
After categorical handling runs, the downstream trainers still see ordinary numeric or binary feature matrices.
So all existing training behavior still applies:
- trees, forests, and GBM all use the same transformed inputs
- canaries still compete against transformed features
- histogram-based split search still works the same way after encoding
- missing-value handling still follows the numeric/binary semantics of the transformed representation
This is both the strength and the limitation of the current design:
- it makes categorical strategies available through the same native training path as the rest of the library
- but the model is still learning on transformed features, not on native categorical predicates
At the API level, this means there is no separate categorical training entry
point. You still call train(...) and pass categorical options directly:
train(
X,
y,
task="classification",
tree_type="cart",
categorical_strategy="dummy",
)
The same applies to the sklearn-style wrappers, which forward the categorical configuration through the same native path.
Serialization And IR¶
Categorical models now serialize their transform contract explicitly.
That includes:
- semantic JSON serialization
- IR export
- compiled optimized serialization
The important design point is that the learned trees still live in encoded feature space, while the serialized categorical metadata preserves the raw input contract and the transform needed to reach that encoded space.
So a serialized categorical model records both:
- the raw categorical/numeric inputs it expects
- the categorical transform required before tree evaluation
This is enough to round-trip categorical models without asking callers to rebuild the transform manually.
The remaining limitation is semantic rather than transport-related:
- categorical strategies are now preserved in IR and serialization
- but the learned trees still operate on transformed numeric/binary features, not on native categorical subset predicates
Practical Guidance¶
If you want a simple starting point:
- use
dummywhen category counts are small
If width becomes a problem:
- try
target
If you want a compact LightGBM-like category ordering surrogate for binary split learners:
- try
fisher
There is no universal winner. The right choice depends on:
- cardinality
- rare-category frequency
- whether interpretability matters
- whether width or leakage risk is the bigger practical concern