Intermediate Representation¶

ForestFire has an explicit model IR instead of treating serialization as a direct dump of internal Rust structs.

That decision is central to the project.

The IR is the stable semantic layer between:

training-time model structures
optimized inference runtimes
Python and Rust bindings
serialized artifacts on disk

The important word there is semantic.

The IR is not trying to be the fastest possible execution format. It is trying to be the clearest possible statement of:

what inputs the model expects
what each split means
what each leaf payload means
how outputs should be interpreted
which preprocessing assumptions must be preserved for inference to remain correct

Why the IR exists¶

Many libraries effectively serialize whatever their internal training structs happen to look like.

That is easy at first, but it creates problems quickly:

trainer internals leak into the exported format
optimized runtimes become hard to reconstruct safely
introspection and export drift apart
changing internal layouts becomes risky

ForestFire avoids that by making the IR explicit.

The IR describes what a model means, not how a particular Rust struct happens to store it in memory.

That is why:

Model
OptimizedModel
JSON serialization
tree introspection
dataframe export

can all agree with one another even though they do not store or consume the model in the same way internally.

What the IR contains¶

The top-level package is ModelPackageIr.

It includes:

producer metadata
model structure
input schema
output schema
inference options
preprocessing description
postprocessing description
training metadata
integrity / compatibility metadata

In practical terms, the IR answers questions like:

what algorithm family produced this model?
how many trees are there?
how are the features binned?
what do leaf values mean?
how should outputs be interpreted?
what assumptions does an inference runtime need to preserve?

It also answers a subtler but equally important question:

what parts of the system are allowed to change without changing model meaning?

For ForestFire, that includes:

optimized runtime layout
feature projection used by optimized inference
ensemble runtime ordering
compiled artifact structure

None of those are part of the canonical IR because none of them change the semantic function the model computes.

Why this matters for users¶

The IR is not just an implementation detail.

It is what makes these features line up with one another:

to_ir() / JSON serialization
optimize_inference()
tree_structure(...)
tree_node(...), tree_level(...), tree_leaf(...)
to_dataframe()

Without a stable semantic representation, those would each need their own partially overlapping view of the model.

With the IR, they all derive from the same source of truth.

This is also why optimized and non-optimized models export the same IR. If they exported different semantic artifacts, optimization would become a semantic transformation instead of an execution transformation, which is exactly what the design is trying to avoid.

Tree representations in the IR¶

ForestFire exports trees in one of two structural forms:

node_tree
oblivious_levels

`node_tree`¶

This is used for standard trees such as:

id3
c45
cart
randomized

The IR stores:

explicit nodes
node ids
split metadata
child references
leaf payloads
node statistics

This representation is natural for trees whose structure is irregular and branch-specific.

That includes:

ordinary CART-like binary trees
randomized trees, which are structurally like CART but differ in how candidate splits are chosen
id3 and c45, whose learned structure may include multiway branches

For binary node trees, the IR can now also represent oblique splits in addition to ordinary axis-aligned threshold splits.

Those oblique splits are currently sparse and pairwise:

w1 * x_i + w2 * x_j <= t

So the IR stores not only the participating feature indices and weights, but also the learned missing-direction metadata for those two features.

`oblivious_levels`¶

This is used for oblivious trees.

Instead of storing arbitrary node connectivity, the IR stores:

one split per level
leaf indexing rules
leaf payloads
per-level and per-leaf statistics

This matches the semantics of oblivious trees directly:

every node at the same depth shares the same split
leaf selection can be described as bit accumulation

That is both more compact and more faithful than pretending an oblivious tree is just an ordinary node graph.

It also makes the runtime/lowering boundary cleaner:

the IR expresses the native semantics of an oblivious tree
the optimized runtime can then choose scalar, SIMD, or batch-oriented execution without first having to reverse-engineer a generic node graph back into a level-wise form

Leaf payloads¶

Leaves are represented semantically, not just numerically.

The IR distinguishes:

regression values
classification class indices and class values

Why both index and value?

Because the runtime often wants compact class-index-based execution, while the user-facing API still needs a stable class order and the original class labels.

The IR preserves both sides of that story.

Input schema and preprocessing¶

ForestFire treats preprocessing as part of the model contract.

That means the IR stores:

feature count and feature ordering
feature logical types
whether feature names are accepted
numeric bin boundaries
binary-feature handling

This matters because ForestFire models do not consume arbitrary floating-point inputs directly on the hot path. They consume the binned representation implied by training-time preprocessing.

If that preprocessing were not described explicitly, a deserialized model would not be self-sufficient.

That point is especially important now that:

training uses adaptive power-of-two numeric binning
optimized runtimes project to the subset of used features
compact runtime batches store bin ids as u8 or u16

All of those runtime optimizations depend on the semantic preprocessing contract being explicit and reconstructible.

Categorical transforms in the IR¶

Categorical models now also serialize their categorical transform contract in the IR.

That categorical block records:

the raw input feature schema
the categorical strategy in use
the feature selection for categorical handling when explicitly configured
the base smoothing parameter
the fitted transform state needed to reproduce encoded inference

This matters because categorical strategies in ForestFire are currently implemented as explicit transforms into numeric or binary feature space.

So for a categorical model, the IR is intentionally describing two related but different spaces:

the raw user-facing input space
the encoded feature space consumed by the tree structure

For dummy, the IR preserves the learned indicator expansion.

For target, the IR preserves the fitted per-category target-derived mappings and priors.

For fisher, the IR preserves the learned category ordering that maps raw categories onto numeric ranks before split evaluation.

That means a deserialized categorical model can once again accept raw mixed inputs directly rather than requiring the caller to rebuild the transform out of band.

Raw schema vs encoded feature space¶

This split is the most important categorical IR subtlety.

The tree definitions themselves still refer to encoded feature indices, because that is the feature space the learner actually trained on.

But the top-level input contract for a categorical model is still the raw input schema, because that is what callers are expected to provide at prediction time.

So the IR is effectively saying:

these are the raw inputs the user supplies
this is the categorical transform applied first
these are the encoded features the trees then evaluate

That is why categorical IR support required more than just flipping a boolean flag. The transform had to become part of the semantic contract.

Missing semantics in the IR¶

Missing-value behavior is also part of the IR contract.

For ordinary axis-aligned binary nodes, that means the IR-compatible semantic tree distinguishes:

missing goes left
missing goes right
missing falls back to the node prediction if no missing route was learned

For oblique nodes, the IR preserves the learned missing-direction metadata for each participating feature separately. That matters because the optimized runtime and deserialized semantic model both need to replay the same routing decision before evaluating the linear projection.

Output schema and postprocessing¶

The IR also explains how to interpret outputs.

That includes:

raw outputs
final outputs
class ordering when applicable
postprocessing steps

This is especially important for classification and boosting, where the raw runtime quantity and the final user-facing quantity are not always the same conceptual object.

Training metadata¶

The IR keeps training metadata alongside model structure.

That includes:

algorithm
task
tree type
criterion
canaries
tree count
max depth
min-sample controls
max_features
seed
OOB settings and score
boosting parameters such as learning rate and gradient sampling fractions

This metadata is not needed to score the model, but it is essential for:

reproducibility
inspection
debugging
binding-level property reflection

That is why model objects can expose those parameters as properties without having to reach back into the original training call.

IR vs optimized runtime¶

The IR is not the same thing as the optimized runtime.

That distinction is important.

The IR is designed for:

semantic clarity
portability
serialization
introspection

The optimized runtime is designed for:

compact execution layouts
batched traversal
SIMD-friendly access patterns

The runtime is lowered from the semantic model and IR-compatible structure. It is not the canonical serialized form.

In practice, that means:

the IR contains the full semantic feature space
the optimized runtime may use a projected feature space internally
the IR contains semantic tree ordering
the optimized runtime may reorder ensemble members for locality
the IR contains semantic node/leaf meaning
the optimized runtime may use fallthrough layouts, lookup tables, and compact batch representations

That separation keeps the project flexible:

runtime layouts can improve without breaking serialized artifacts
introspection can stay stable even if the runtime gets more specialized

For categorical models there is one extra distinction:

the IR describes the raw categorical input contract plus the transform into encoded feature space
the optimized runtime still executes on the encoded feature space after that transform has been applied

IR vs compiled optimized artifacts¶

The compiled optimized artifact is a separate layer on top of the IR.

It exists because optimized lowering itself is real work. A compiled artifact can cache:

the semantic IR
the lowered runtime layout
optimized-runtime metadata such as feature projection

Why not make that compiled artifact the main model format:

it is backend-oriented rather than semantics-oriented
it is harder to diff, inspect, and validate manually
runtime layouts are more likely to evolve than semantic model meaning

So the project keeps a clean separation:

IR for semantic truth
compiled artifacts for faster reload of one particular optimized runtime

Schema generation¶

The checked-in schema lives at:

crates/core/schema/forestfire-ir.schema.json

It is generated from the Rust IR definitions using schemars.

That gives the project two useful guarantees:

the schema stays aligned with the actual code
changes to the IR surface are visible and testable

The schema test exists precisely so accidental IR drift does not go unnoticed.

That matters more as the runtime grows more sophisticated. The more execution-side optimization ForestFire adds, the more important it becomes that the semantic layer stay explicit and regression-tested.

Why the IR is a design feature, not just a file format¶

The most important point is that the IR is not an afterthought.

It is what lets ForestFire keep these promises at the same time:

train through one API
inspect what was learned
serialize the result
lower it for faster inference
preserve the same semantics across Rust and Python

Without the IR, those concerns would tend to fragment into separate incompatible representations.

With the IR, they stay aligned.

Intermediate Representation¶

Why the IR exists¶

What the IR contains¶

Why this matters for users¶

Tree representations in the IR¶

node_tree¶

oblivious_levels¶

Leaf payloads¶

Input schema and preprocessing¶

Categorical transforms in the IR¶

Raw schema vs encoded feature space¶

Missing semantics in the IR¶

Output schema and postprocessing¶

Training metadata¶

IR vs optimized runtime¶

IR vs compiled optimized artifacts¶

Schema generation¶

Why the IR is a design feature, not just a file format¶

`node_tree`¶

`oblivious_levels`¶