Intermediate Representation¶
ForestFire has an explicit model IR instead of treating serialization as a direct dump of internal Rust structs.
That decision is central to the project.
The IR is the stable semantic layer between:
- training-time model structures
- optimized inference runtimes
- Python and Rust bindings
- serialized artifacts on disk
The important word there is semantic.
The IR is not trying to be the fastest possible execution format. It is trying to be the clearest possible statement of:
- what inputs the model expects
- what each split means
- what each leaf payload means
- how outputs should be interpreted
- which preprocessing assumptions must be preserved for inference to remain correct
Why the IR exists¶
Many libraries effectively serialize whatever their internal training structs happen to look like.
That is easy at first, but it creates problems quickly:
- trainer internals leak into the exported format
- optimized runtimes become hard to reconstruct safely
- introspection and export drift apart
- changing internal layouts becomes risky
ForestFire avoids that by making the IR explicit.
The IR describes what a model means, not how a particular Rust struct happens to store it in memory.
That is why:
ModelOptimizedModel- JSON serialization
- tree introspection
- dataframe export
can all agree with one another even though they do not store or consume the model in the same way internally.
What the IR contains¶
The top-level package is ModelPackageIr.
It includes:
- producer metadata
- model structure
- input schema
- output schema
- inference options
- preprocessing description
- postprocessing description
- training metadata
- integrity / compatibility metadata
In practical terms, the IR answers questions like:
- what algorithm family produced this model?
- how many trees are there?
- how are the features binned?
- what do leaf values mean?
- how should outputs be interpreted?
- what assumptions does an inference runtime need to preserve?
It also answers a subtler but equally important question:
- what parts of the system are allowed to change without changing model meaning?
For ForestFire, that includes:
- optimized runtime layout
- feature projection used by optimized inference
- ensemble runtime ordering
- compiled artifact structure
None of those are part of the canonical IR because none of them change the semantic function the model computes.
Why this matters for users¶
The IR is not just an implementation detail.
It is what makes these features line up with one another:
to_ir()/ JSON serializationoptimize_inference()tree_structure(...)tree_node(...),tree_level(...),tree_leaf(...)to_dataframe()
Without a stable semantic representation, those would each need their own partially overlapping view of the model.
With the IR, they all derive from the same source of truth.
This is also why optimized and non-optimized models export the same IR. If they exported different semantic artifacts, optimization would become a semantic transformation instead of an execution transformation, which is exactly what the design is trying to avoid.
Tree representations in the IR¶
ForestFire exports trees in one of two structural forms:
node_treeoblivious_levels
node_tree¶
This is used for standard trees such as:
id3c45cartrandomized
The IR stores:
- explicit nodes
- node ids
- split metadata
- child references
- leaf payloads
- node statistics
This representation is natural for trees whose structure is irregular and branch-specific.
That includes:
- ordinary CART-like binary trees
- randomized trees, which are structurally like CART but differ in how candidate splits are chosen
id3andc45, whose learned structure may include multiway branches
For binary node trees, the IR can now also represent oblique splits in addition to ordinary axis-aligned threshold splits.
Those oblique splits are currently sparse and pairwise:
w1 * x_i + w2 * x_j <= t
So the IR stores not only the participating feature indices and weights, but also the learned missing-direction metadata for those two features.
oblivious_levels¶
This is used for oblivious trees.
Instead of storing arbitrary node connectivity, the IR stores:
- one split per level
- leaf indexing rules
- leaf payloads
- per-level and per-leaf statistics
This matches the semantics of oblivious trees directly:
- every node at the same depth shares the same split
- leaf selection can be described as bit accumulation
That is both more compact and more faithful than pretending an oblivious tree is just an ordinary node graph.
It also makes the runtime/lowering boundary cleaner:
- the IR expresses the native semantics of an oblivious tree
- the optimized runtime can then choose scalar, SIMD, or batch-oriented execution without first having to reverse-engineer a generic node graph back into a level-wise form
Leaf payloads¶
Leaves are represented semantically, not just numerically.
The IR distinguishes:
- regression values
- classification class indices and class values
Why both index and value?
Because the runtime often wants compact class-index-based execution, while the user-facing API still needs a stable class order and the original class labels.
The IR preserves both sides of that story.
Input schema and preprocessing¶
ForestFire treats preprocessing as part of the model contract.
That means the IR stores:
- feature count and feature ordering
- feature logical types
- whether feature names are accepted
- numeric bin boundaries
- binary-feature handling
This matters because ForestFire models do not consume arbitrary floating-point inputs directly on the hot path. They consume the binned representation implied by training-time preprocessing.
If that preprocessing were not described explicitly, a deserialized model would not be self-sufficient.
That point is especially important now that:
- training uses adaptive power-of-two numeric binning
- optimized runtimes project to the subset of used features
- compact runtime batches store bin ids as
u8oru16
All of those runtime optimizations depend on the semantic preprocessing contract being explicit and reconstructible.
Categorical transforms in the IR¶
Categorical models now also serialize their categorical transform contract in the IR.
That categorical block records:
- the raw input feature schema
- the categorical strategy in use
- the feature selection for categorical handling when explicitly configured
- the base smoothing parameter
- the fitted transform state needed to reproduce encoded inference
This matters because categorical strategies in ForestFire are currently implemented as explicit transforms into numeric or binary feature space.
So for a categorical model, the IR is intentionally describing two related but different spaces:
- the raw user-facing input space
- the encoded feature space consumed by the tree structure
For dummy, the IR preserves the learned indicator expansion.
For target, the IR preserves the fitted per-category target-derived mappings
and priors.
For fisher, the IR preserves the learned category ordering that maps raw
categories onto numeric ranks before split evaluation.
That means a deserialized categorical model can once again accept raw mixed inputs directly rather than requiring the caller to rebuild the transform out of band.
Raw schema vs encoded feature space¶
This split is the most important categorical IR subtlety.
The tree definitions themselves still refer to encoded feature indices, because that is the feature space the learner actually trained on.
But the top-level input contract for a categorical model is still the raw input schema, because that is what callers are expected to provide at prediction time.
So the IR is effectively saying:
- these are the raw inputs the user supplies
- this is the categorical transform applied first
- these are the encoded features the trees then evaluate
That is why categorical IR support required more than just flipping a boolean flag. The transform had to become part of the semantic contract.
Missing semantics in the IR¶
Missing-value behavior is also part of the IR contract.
For ordinary axis-aligned binary nodes, that means the IR-compatible semantic tree distinguishes:
- missing goes left
- missing goes right
- missing falls back to the node prediction if no missing route was learned
For oblique nodes, the IR preserves the learned missing-direction metadata for each participating feature separately. That matters because the optimized runtime and deserialized semantic model both need to replay the same routing decision before evaluating the linear projection.
Output schema and postprocessing¶
The IR also explains how to interpret outputs.
That includes:
- raw outputs
- final outputs
- class ordering when applicable
- postprocessing steps
This is especially important for classification and boosting, where the raw runtime quantity and the final user-facing quantity are not always the same conceptual object.
Training metadata¶
The IR keeps training metadata alongside model structure.
That includes:
- algorithm
- task
- tree type
- criterion
- canaries
- tree count
- max depth
- min-sample controls
max_features- seed
- OOB settings and score
- boosting parameters such as learning rate and gradient sampling fractions
This metadata is not needed to score the model, but it is essential for:
- reproducibility
- inspection
- debugging
- binding-level property reflection
That is why model objects can expose those parameters as properties without having to reach back into the original training call.
IR vs optimized runtime¶
The IR is not the same thing as the optimized runtime.
That distinction is important.
The IR is designed for:
- semantic clarity
- portability
- serialization
- introspection
The optimized runtime is designed for:
- compact execution layouts
- batched traversal
- SIMD-friendly access patterns
The runtime is lowered from the semantic model and IR-compatible structure. It is not the canonical serialized form.
In practice, that means:
- the IR contains the full semantic feature space
- the optimized runtime may use a projected feature space internally
- the IR contains semantic tree ordering
- the optimized runtime may reorder ensemble members for locality
- the IR contains semantic node/leaf meaning
- the optimized runtime may use fallthrough layouts, lookup tables, and compact batch representations
That separation keeps the project flexible:
- runtime layouts can improve without breaking serialized artifacts
- introspection can stay stable even if the runtime gets more specialized
For categorical models there is one extra distinction:
- the IR describes the raw categorical input contract plus the transform into encoded feature space
- the optimized runtime still executes on the encoded feature space after that transform has been applied
IR vs compiled optimized artifacts¶
The compiled optimized artifact is a separate layer on top of the IR.
It exists because optimized lowering itself is real work. A compiled artifact can cache:
- the semantic IR
- the lowered runtime layout
- optimized-runtime metadata such as feature projection
Why not make that compiled artifact the main model format:
- it is backend-oriented rather than semantics-oriented
- it is harder to diff, inspect, and validate manually
- runtime layouts are more likely to evolve than semantic model meaning
So the project keeps a clean separation:
- IR for semantic truth
- compiled artifacts for faster reload of one particular optimized runtime
Schema generation¶
The checked-in schema lives at:
crates/core/schema/forestfire-ir.schema.json
It is generated from the Rust IR definitions using schemars.
That gives the project two useful guarantees:
- the schema stays aligned with the actual code
- changes to the IR surface are visible and testable
The schema test exists precisely so accidental IR drift does not go unnoticed.
That matters more as the runtime grows more sophisticated. The more execution-side optimization ForestFire adds, the more important it becomes that the semantic layer stay explicit and regression-tested.
Why the IR is a design feature, not just a file format¶
The most important point is that the IR is not an afterthought.
It is what lets ForestFire keep these promises at the same time:
- train through one API
- inspect what was learned
- serialize the result
- lower it for faster inference
- preserve the same semantics across Rust and Python
Without the IR, those concerns would tend to fragment into separate incompatible representations.
With the IR, they stay aligned.