Embedded

November 10, 2025
in Scientific Computing, Embedded, no-std
10 min read

2025-11-10 InterpN: Fast Interpolation

InterpN is a low-latency numerical interpolation library for realtime and bare-metal embedded systems, written in Rust. It remains the first and only of its kind in this domain.

Because of the performance constraints of a bare-metal target, InterpN has achieved excellent application-level throughput and found use in scientific computing via the Rust crate and Python bindings.

Results

InterpN achieves up to 200x speedup over the state of the art (Scipy) for N-dimensional interpolation.

Similar speedups are obtained for large batch inputs, with more modest (about 10x) speedups for cubic methods.

These benchmarks graciously exclude the cost of initialization of Scipy's methods, which allocate, do a linear solve, and perform other preparation during initialization.

By contrast, InterpN can be called directly on a reference to the data at any time with no initialization penalty.

Finally, InterpN produces about 8 orders of magnitude less floating-point error than Scipy.

Motivation

Interpolation is one of the most common tasks in scientific computing and realtime/embedded systems.

Between these two domains, usage falls into two broad categories

Small batch / low latency (optimization, realtime, embedded)
Large batch / high throughput (visualization, fluid/material property lookups, grid data remapping)

InterpN targets the first category, but excels in both.

The Algorithm

The combination of low latency, high throughput, and bare-metal compatibility is achieved through a (bounded) recursive algorithm.

The concept itself is easy: do 1D interpolation on one axis at a time. After visiting each of the N dimensions, you arrive at the final interpolated value.

The trick to making it efficient is to formulate the data access as a tree of depth N instead of as an N-cube:

algorithm diagram

This method extends to higher dimensions by deepening the tree, and applies to linear as well as cubic methods by changing the degree of the tree and the 1D interpolation kernel.

This algorithm is vaguely related to the recursive de Casteljau's algorithm and de Boor's algorithm. In practice, it looks nothing like either one and shares no implementation details.

The 1D cubic spline evaluation uses Horner's method to minimize the total number of FLOPs and roundoff events.

In an earlier post, I tabled the recursive algorithm to avoid heap allocation. Since then, more careful thought resulted in the tree formulation.

Performance Hygiene

There are some things we can do to improve performance without touching the code.

These settings will be used when building your crate directly as a binary, Python extension library, or similar.

Optimizations, codegen-units, and LTO

First, configure the release profile in Cargo.toml.

[profile.release]
codegen-units = 1       # Optimize the whole crate together
lto = true              # Reduces binary size + small speedup
overflow-checks = true  # Free reliability!

x86 Instruction Sets

Next, configure compiler flags in .cargo/config.toml.

By default, rustc targets CPUs from 2001. For both scientific computing and realtime systems, it's fair to target newer machines.

To target about 2008 onward,

[target.'cfg(any(target_arch = "x86_64", target_arch = "x64"))']
rustflags = ["-C", "target-feature=+fma"]  # incl. SSE4 and AVX

Or, for about 2013 onward,

[target.'cfg(any(target_arch = "x86_64", target_arch = "x64"))']
rustflags = ["-C", "target-feature=+avx2,+fma]

In a few years, when AVX512 (released 2016) has become more common, this list might look like

[target.'cfg(any(target_arch = "x86_64", target_arch = "x64"))']
rustflags = ["-C", "target-feature=+avx2,+fma,+avx512f]

although, as of 2025, AVX512 can cause excessive latency and heating for some use-cases.

These settings allow rustc to produce modern vector (SIMD) instructions for x86 processors. The rustc documentation contains the full list of available x86 instruction sets.

Next, we'll look at how to write code to use these for massive speedups.

Vectorization Two Ways

Now that we've got vector instructions, let's use them.

Primer

As a brief primer on vectorization: the term refers to the use of SIMD (single instruction multiple data) aka vector instructions to perform multiple float operations at once independent of threading. Vectorization is associated with the benefit of reading from sequential and/or cache-lane-aligned memory addresses for efficient reads from RAM, although in a strict sense, this is a distinct benefit of the patterns used to produce vector instructions. The combined effect of both is typically a 5-10x speedup.

Nick Wilcox's auto-vectorizer tutorial is an excellent recipe for what I'll call "Type 1" vectorization.

Type 1 (numpy, ndarray etc): simple loops performing one operation on each datum
Type 2 (InterpN): complex loops performing as many operations as possible on each datum

Both methods rely on the autovectorizer and require careful elimination of bounds checks.

Eliminating Allocation

For example, take the expression a = b * c^2; d = b^2 + c^3; e = a + d;.

In a Type 1 vectorized approach,

// With ndarrays b, c
let a = b * c.powi(2);          // Allocates 2 times
let d = b.powi(2) + c.powi(3);  // Allocates 3 times
let e = a + d;                  // Allocates again

Each individual operation is well-vectorized and readable, but

⚠️ This produces slow traffic back-and-forth to RAM
⚠️ Extra allocations increase latency and total RAM usage
⚠️ The compiler and CPU can only optimize one operation at a time
- c^2 is not reused between the calculation of a and d
- b^2 won't be pipelined alongside c^2
- ...and so on. Even for such a small expression, there are many missed optimizations!
⚠️ Bounds checks are repeated for each operation

A Type 2 approach would look like this:

// With slices b, c
assert!(b.len() == c.len())      // Pull bounds check forward
let mut e = vec![0.0; b.len()];  // Allocate once

// Unidiomatic, but we checked bounds, so it's OK to index.
for i in 0..e.len() {
  let (bi, ci) = (b[i], c[i]);
  let a = bi * ci * ci;
  let d = bi * bi + ci * ci * ci;
  e[i] = a + d;
}

and produces higher overall throughput by resolving all those issues.

✅ Minimum possible RAM accesses and heap allocations
✅ Compiler automatically reuses duplicate values
✅ CPU can execute independent operations concurrently without threading
✅ Bounds checks are only performed once
✅ Compiler generates vector instructions
- Both spanning sequential indices and within the scalar expression

Finally, to produce a Type 2 pattern, we don't actually write down the entire scalar expression in a loop like that example. Instead, we can simply rely on inlining to do it for us.

// ...
#[inline]
fn my_expr(b: f64, c: f64) -> f64 {
  let a = b * c * c;
  let d = b * b + c * c * c;
  e[i] = a + d;
}

// ...
for i in 0..e.len() {
  e[i] = my_expr(b[i], c[i]);
}

Careful Branching & Cache-Locality

InterpN defers between 7 distinct cases for cubic interpolation:

out                     out
---|---|---|---|---|---|--- Grid
 2   2   3   3   3   2   2  Order of interpolant between grid points
 1                       1  Extrapolation with linearize_extrapolation

This is enabled by a conspicous benefit of the Type 2 pattern: robustness to branching in the loop.

In a Type 1 pattern, branches in the loop disrupt autovectorization, cause cache misses, and obliterate performance. This drives implementations like Scipy's pchip to pre-calculate knot coefficients during initialization.

Meanwhile, branching (carefully) in a Type 2 pattern is totally okay as long as all the data and procedures required for both branches are present in cache, and the need to allocate and calculate coefficients during initialization is eliminated.

Fused Multiply-Add

FMA allows combining a * x + b into a single operation a.mul_add(x, b) with a single roundoff event, increasing throughput and decreasing floating-point error.

When to Use FMA

FMA is not available as a hardware instruction on all CPUs. Sometimes, even when it is available (as with the Cortex M7 microcontroller), the increased latency may be unacceptable.

As a rule of thumb, FMA is usually beneficial for desktop applications, but usually detrimental for microcontrollers.

How to Use FMA

To support the user of a library in choosing whether or not FMA is good for their application, a feature flag in Cargo.toml that's not associated with any dependencies:

[features]
fma = []

Then, at each location that could benefit from hardware FMA, use it behind a feature flag:

// Horner's method
#[cfg(not(feature = "fma"))]
let y = y0 + t * (c1 + t * (c2 + t * c3));

#[cfg(feature = "fma")]
let y = c3.mul_add(t, c2).mul_add(t, c1).mul_add(t, y0);

A Brief Aside about libm

Some math functions in Rust defer to system math libraries and produce inconsistent behavior between platforms. For functions like sqrt, log, and other transcendentals, make sure to use libm for a pure-rust implementation that is much more consistent and reliable.

Traversing Unrealized Trees

Recall the tree that we're traversing:

tree diagram

The key to eliminating allocation in this tree formulation is to recognize that we never need to actually build the tree.

Two distinct evaluations patterns are used by InterpN to bypass fully realizing the tree: bounded recursion for large number of dimensions, and const-unrolling otherwise.

Bounded Recursion

We can traverse the tree by bounded recursion, using stack frames to automatically expand the storage required to visit each node.

This allows us to realize the minimum storage to represent a path through the tree in N dimensions with a footprint of size F (O(F*N)), which is a massive improvement over realizing the entire cube (O(F^N)).

Pros

✅ O(F*N) storage
✅ Easy to relate to the interpolation method
✅ Quick to compile
✅ One function can support different numbers of dimensions

Cons

⚠️ Slow due to use of stack frames (still faster than Scipy, but could be better)
⚠️ Statically un-analyzable call stack due to recursion (not so good for embedded)
⚠️ The recursive function can't be fully inlined; we could be helping the compiler more

Const-Unrolling

Alternatively, we can reformulate the recursion as a for-loop over a fixed number of dimensions N using const generics and crunchy::unroll! to flatten the entire tree into a single stack frame at compile-time.

Const-unrolling with a parametrized bound looks like this:

// With `const N: usize`
const { assert!(N < 5) };

crunchy::unroll! {
    for i < 5 in 0..N {
      // Loop expanded at compile-time
    }
}

Const-unrolling can be applied to the entire traversal of the dependency tree when evaluating an interpolation, because for a given number of dimensions and footprint size, that tree always looks exactly the same.

Pros

✅ O(F*N) storage
✅ Very fast; compiler can optimize over the entire calculation in a single stack frame
✅ Statically-analyzable call stack
✅ Const indexing (everything except float math and index offsets happens at compile-time)

Cons

⚠️ Long compile times for large N
⚠️ Requires compiling for an exact number of dimensions
⚠️ Harder to relate to the interpolation method (it doesn't look like a recursion)

Profile-Guided Optimization

PGO is the practice of doing a two-pass build to allow the compiler to do optimizations with some knowledge of how the program will actually be run. Speedups are typically in the 20%-100% range, but can vary far outside that range.

PGO for Rust Binaries

Jakub Beránek's blog post provides an excellent introduction to PGO for Rust binaries, and his cargo-pgo tool streamlines the process described there.

The workflow for Rust binaries is like so:

Build the Rust binary with PGO instrumentation
Run representative workloads using that binary to collect usage data
Combine multiple raw profiles into a single combined profile
Rebuild the binary using the profile to enhance compiler optimizations

PGO for Python Extension Libraries

For a Python extension library, the procedure is similar:

Build the extension library with PGO instrumentation
Run a Python profile workflow
Combine raw profiles
Store the profile
Use the stored profile to optimize the Python extension when building wheels to deploy to PyPI

For InterpN, steps 1-4 look like so:

# Build instrumented wheel
cargo clean
uv cache clean
uv pip install maturin
rm -rf dist/
UV_NO_BUILD_CACHE=1 uv run --no-sync maturin build --compatibility pypi --out dist --verbose -- "-Cprofile-generate=${PWD}/scripts/pgo-profiles/pgo.profraw"

# Install instrumented wheel
uv pip install $(find dist/ -name '*.whl')[pydantic] --group test --group bench --reinstall

# Run reference workload to generate profile
uv run --no-sync ./scripts/profile_workload.py

# Merge profiles
/usr/lib/llvm-21/bin/llvm-profdata merge -o scripts/pgo-profiles/pgo.profdata $(find scripts/pgo-profiles/pgo.profraw -name '*.profraw')

Step 5 is implemented as a minor edit to maturin actions workflow's args input to set the path to the profile data:

- name: Build wheels
  uses: PyO3/maturin-action@v1
  with:
    target: ${{ matrix.platform.target }}
    args: --release --out dist --find-interpreter --verbose -- "-Cprofile-use=${{github.workspace}}/scripts/pgo-profiles/pgo.profdata" "-Cllvm-args=-pgo-warn-missing-function"
    manylinux: auto

PGO's Sharp Edges

LLVM version must match between
- rustc used locally to generate the profiles
- LLVM used to merge profiles
- and rustc used during deployment
PGO profiles are not portable between platforms
- They won't cause a crash during build, but will silently have no effect
PGO is very sensitive to choice of profile workload
- Don't profile against unit tests or benchmarks! This will likely make the program slower under real use
PGO profiles must be passed in as an absolute path; relative paths are not expanded by rustc
No configuration in Cargo.toml or .cargo/config.toml can result in PGO being used automatically

As always, go carefully, but do go.

Conclusion

no-std compatibility in scientific computing libraries may seem tangential to the task at hand, but ultimately produces excellent performance, allows reusing the same code for embedded and applications, and, as of recently, allows reuse as a GPU kernel as well.

This Is Not a One-Off

These techniques are not specific to interpolation, and have been applied with successfully to signal processing, realtime controls, analytic electromagnetics, fluid mechanics, structural mechanics, and computational geometry.

Next Steps

PGO for platforms other than linux
Thread parallelism with rayon
- This is incredibly easy, but just hasn't been necessary yet
- We'll use another core when we're done with the first one
Cross-platform GPU kernel extension using rust-gpu and wgpu
- Because it is no-std compatible, InterpN is already GPU-ready! Just needs some plumbing

November 24, 2023
in Scientific Computing, Embedded, no-std
18 min read

2023-11-24: Building a Hypercube Interpolator

Edited on 2023-11-30 to add more implementation details.

Results

The TL;DR is that we can achieve a 10-150x speedup over the reference (FITPACK via scipy) when interpolating at a small number of observation points while eliminating allocation (!) and staying close to perf parity for larger inputs.

This is particularly relevant to engineering optimization, where differentiation workloads for interpolators typically rely on forward differences that make many function calls with a small number of inputs.

Similarly, in electrical, fluid, or thermal simulation, sim times are often bound to overhead from making many calls for a small number of outputs from real-material property data tables through an interface such as CoolProp, and this is compounded for optimization through such simulations.

Finally, some perf charts for a reference case on fair ground: interpolation at (pre-selected) random locations inside a 20x20x20 3D grid.

Without preallocation of outputs for the methods described here: no_prealloc

Unlike the scipy bindings, the methods presented here can evaluate into preallocated output storage.

With preallocation of outputs for the methods described here: with_prealloc

Hold on - did they get worse with preallocation? No! This is an artifact of the benchmarking process, but one that accentuates the value of eliminating allocation. The benchmarks are interleaved (method 1 x100, method 2 x100, method 3 x100, ...), so they see some performance cross-talk through allocation.

Eliminating allocation from the interpn methods significantly improves scipy's performance in neighboring benchmarks, despite both methods using an un-instrumented warmup and manually calling python's garbage collector between benchmarks. The scipy method is still allocating, it just has less overhead - so improving interpn's memory performance by evaluating into a preallocated array helps scipy even more than it helps interpn. Quite an on-the-nose example of the value of avoiding allocation in application-level code!

Rationale

First, a discussion of the state of the art, and why one might bother digging into a field that's surely been a solved problem for some time.

FITPACK exists

As far as I'm aware, the last and only meaningful work done on general and reusable interpolation methods for scientific computing was Dierckx's excellent 1994 textbook and the accompanying Fortran library, FITPACK.

The Dierckx FITPACK library has since been absorbed into netlib alongside BLAS and LAPACK, and remains under a homebrewed license.

As such, it is usually integrated into other ecosystems as a user-side dependency, often in addition to all of the dependencies required to compile Fortran and produce bindings in the target language.

Notably, the scipy project includes bindings to many FITPACK routines. In Rust, the splinify library provides bindings to select splining routines, although I have not been able to get through the game of dynamic dependency whackamole to get it to build successfully.

FITPACK supports many pieces of functionality that are out of scope here - in particular, Dierckx gives a thorough and performant treatment to fitting the interpolator coefficients, smoothing, and surface identification, often by finding fairly inspired phrasings of fitting problems as quadratic programs regardless of the polynomial order of the curvefits.

This is not FITPACK

The goal of this project is not to replicate all of FITPACK's functionality. In fact, curve fitting in its entirety is out of scope. Instead, this project will focus on the interpolation tasks common in scientific computing, which (possibly after some uncertainty-weighted fitting) typically hold a hard requirement on the interpolator reproducing the anchor values exactly.

Unlike fitting and interpolation for graphics, the correctness of the result is the highest priority, followed by performance and ergonomics in close succession. Smoothing or other manipulation of the data are anti-goals, as they directly compromise correctness.

This is something more specific

Good quality scientific methods for curvefitting are a larger topic than interpolation, and while there are surely patterns in the process, those methods also tend to be fairly bespoke to the problem under examination.

Functionality goals

Evaluation of values (and later, gradient and hessian entries) for exact interpolators in Rust
Self-contained (evaluate using only grid locations and data values)
Avoid the need for another program or library to provide spline knots, etc
Clear and permissive licensing
Convenient, structured, and version-controlled build process
Convenient inter-languge interop & first-class support for bindings
Compatibility with embedded and high-performance systems (no allocation)
Smooth transition from interpolation to extrapolation (to support solvers)
Separation of bounds error checking from interpolation/extrapolation (to support solvers)

Tooling

	Rust	Python
Repo	interpn	interpnpy
Docs	Docs.rs	readthedocs w/ mkdocs
Benchmark	Criterion	timeit w/ warmup
Lint	rustfmt & clippy	ruff & pyright
Test	cargo test	pytest & mktestdocs
Release	release-plz & cargo-semver-checks	maturin & PyO3

Implementation: selecting a method

First, there are three approaches to interpolation on a regular or rectilinear grid that I've come across, and I had to choose one.

Recursive (like FITPACK)
Convolution (as used in GPU libraries)
Geometric (no implementations I'm aware of)

Recursive method?

The recursive method is fairly intuitive in addition to being fairly performant, and making no compromise on correctness or generality. The idea is to interpolate on one axis at a time, projecting the grid cell into lower dimensions until finally arriving at a point.

3d recursive method

It's easy enough to imagine how this recursive method would generalize to extrapolation or to higher dimensions - do nothing, and it will handle those cases just fine.

However, the amount of data that needs to be actualized at each step of the recursion varies with the number of dimensions - O(2^(ndims - 1)) per observation point, plus or minus a factor of 2 depending on how hard one would like to optimize it.

Whether or not this is a lot depends on circumstance - but it'd sure be neat if we could find a way to use a constant amount of storage, or at least an amount that scales more favorably with the number of dimensions.

Convolutional method?

The convolutional method doesn't make much of a diagram, but it's exceptionally fast on GPU for image scaling, and generalizes to higher polynomials. Unfortunately, it doesn't handle points near the edge of the grid very well, nor extrapolation - in fact, its performance gains are directly the product of sacrificing both correctness and generality.

You also have to develop the footprint coefficients specifically for every dimensionality, and while I'm sure there's some series formulation of it similar to how finite difference coefficients are tabulated, I have no desire to develop that formulation.

It's excellent for resampling images, and there's a huge market for that, but it's not quite what's needed for scientific computing.

Geometric method

The geometric method, as far as I can tell, only exists as a blurb on wikipedia intended to give an intuitive explanation of the behavior of the better methods.

The idea is that each grid cell can be partitioned into sub-cells characterized by the vector from the observation point to each grid cell vertex. This partitioning fully fills the grid cell. A multilinear interpolation can be obtained as the sum of the values at each grid cell multiplied by the area of the opposing sub-cell.

2d geometric method on interior of grid cell

It's intended as an educational aid, but as an algorithm, it has some interesting properties that merit a harder look.

First, the memory scaling is excellent: only O(ndims) stack usage is ever needed simultaneously per observation point.

Second, it is better-positioned to avoid floating point error due to aggregating the contribution from each of the grid cell vertices simultaneously (a sum of products) rather than the recursive method's long chain of alternating subtractions, divisions, and multiplications. In fact, division is eliminated from intermediate steps entirely, and the longest chain of error-prone add or sub operations is of a constant length 2 instead of length ndims. The longest chain of multiplications is ndims, but only one division is required per observation point regardless of dimensionality, and multiplications are a relatively minor contributor to float error compared to addition and subtraction.

Third, it may actually be more correct, in a sense, than the recursive method, because under extrapolation in corner regions, it will not extrapolate the derivatives on each axis the way that the recursive method does.

Finding the grid cell

Regardless of the choice of method, we'll always need to know what grid cell the observation point lands in, in order to know which tabulated values contribute to the interpolated result.

Regular grid

For regular grids where all the grid locations are evenly spaced, this is logically easy if we're inside the grid:

// for some observation point coordinate `x`
let int_loc = ((x - grid_start) / grid_step) as usize;

It gets a little more complicated when we consider that the point may be outside the grid, in which case we need to snap to the grid:

let int_loc = (((x - grid_start) / grid_step) as usize)
              .min(grid_size - 1)
              .max(0);

But wait - we can't actually represent a value below zero as a usize, so we need a stop over as a signed integer to represent extrapolation below the grid without integer overflow:

let int_loc = (((x - grid_start) / grid_step) as isize)
              .min(grid_size - 1)
              .max(0) 
              as usize;

We can't acturally run that yet, though, because we can't just as isize and as usize a generic float type.

Converting a float to an integer also isn't exactly a clear and concise operation - it can be done sketchily with a handful of bit shift mask operations, but to produce a correct result in general, and to avoid harmful edge cases, is not always possible, and we may have to generate an error here to avoid returning an incorrect result.

Our observation point x is a generic T: Float using num_traits to generalize over f32 and f64. In order to also capture safe typecasting, we can add another trait from num_traits, NumCast.

let floc = (x - grid_start) / grid_step); // Float location
let iloc = <isize as NumCast>::from(floc).unwrap(); // Signed integer location
let uloc = iloc.min(grid_size - 1).max(0) as usize;  // Usable unsigned index

That got a little weird, and there's a conspicuous unwrap() there: in fact, this is expected, because if we ever try to convert, say, a NaN value, or a value larger than isize::MAX, we'll have a real un-handleable error that needs to propagate to prevent the method from returning a genuinely incorrect result.

Rectilinear grid

For a rectilinear grid, we have an uneven spacing of grid cells on each axis. This means that in general, we're not sure exactly where a given observation point lands in the grid until we check it against the grid values.

If we don't know anything about the location of the observation point other than that it's more likely to be inside the grid than outside, then a bisection search is the theoretical optimum for finding the grid index.

The Rust core library implements an excellent binary search, which we can use via the slice::partition_point() interface.

let uloc = grid.partition_point(|v| *v < x);

In practice, the simplicity gets engineered out of it to handle extrapolation in a similar way to what happened with the regular grid method, but that's the core of the idea.

Bagging the memory scaling

The improved memory scaling of the geometric method only holds if we can visit the vertices of a given grid cell one by one, accumulating the contribution of each one before moving to the next. The value accumulated from each vertex doesn't depend on the others, so it works out in terms of the flow of information - but there are 2^ndims vertices to visit.

The usual solution for visiting all of them would be something like itertools' multi_cartesian_product - but this requires the standard library, and in order to actually access all of the indices at once, we'd still end up actualizing all 2^ndims vertices at once, breaking the O(ndims) memory scaling.

Starting with the index of the lower corner of the grid cell, we can represent each vertex in terms of an array like [bool; ndims] where each entry is the index offset from that lower corner to the current vertex.

Ultimately, we need to get one row at a time out of a table like this:

dim	2	1	0
offset	0	0	0
	0	0	1
	0	1	0
	0	1	1
	1	0	0
	1	0	1
	1	1	0
	1	1	1

If we don't mind some integer operations, we can actually formulate this table one at a time by keeping a running record of the index offset, and flipping the index of each dimension one at a time every 2^dim indices:

// In a loop to examine the vertices one at a time...
let mut ioffs = [bool; ndims];
let nverts = 2.powi(ndims);
for i in 0..nverts {
    // Every 2^nth vertex, flip which side of the cube we are examining
    // in the nth dimension.
    //
    // Because i % 2^n has double the period for each sequential n,
    // and their phase is only aligned once every 2^n for the largest
    // n in the set, this is guaranteed to produce a path that visits
    // each vertex exactly once.
    for j in 0..ndims {
        let flip = i % 2_usize.pow(j as u32) == 0;
        if flip {
            ioffs[j] = !ioffs[j];
        }
    }

    // ... then accumulate the contribution of the vertex
}

So with that, we have no-std and limited-memory version of multi_cartesian_product specialized to this usage, and we only need to store a tiny number of values to run it.

The cost comes in compute, but ultimately, integer mod with a base that is a power of 2 is not expensive. In any case, this compute has to happen somewhere, and it may as well be here.

Doing this mod operation with manually-assembled bit-shift operations does not improve performance. I haven't looked too hard at the assembly, but I'd bet that's because the compiler knows what's up.

Reinventing the wheel (indexing into an array)

Ok, well, now that we have the vertex, we're still not quite ready to start accumulating the contribution. First, we need to get the grid value associated with that vertex.

Without any external dependencies on array libraries, that means writing our own array indexing.

Assuming we've got a contiguous input array that was formulated from C-style interleaved inputs from an N x M coordinate grid, the values are ordered like this:

[
  z(x0, y0), ... , z(x0, yM),
  z(x1, y0), ... , z(x1, yM), 
  ...
  z(xN, y0), ... , z(xN, yM)
]

In order to know where to find each row, column, or whatever you call the flakes of an array in the higher dimensions, we need to skip the cumulative product of the sizes of all the higher dimensions in order to find the next flake of the lower dimension.

// Sometime before the loop, accumulate the cumulative product
// of sizes of each dimension of the grid

// Populate cumulative product of higher dimensions for indexing.
//
// Each entry is the cumulative product of the size of dimensions
// higher than this one, which is the stride between blocks
// relating to a given index along each dimension.
let mut dimprod = [1_usise; ndims];
let mut acc = 1;
(0..ndims).for_each(|i| {
    dimprod[ndims - i - 1] = acc;
    acc *= self.dims[ndims - i - 1];
});  // This doesn't quite work as a .fold()

for i in 0..nverts {
    //  ... sometime later in the loop where we did some shenanigans
    //      with a mod operation to find the `ioffs` entries

    // Accumulate the index into the value array,
    // saturating to the bound if the resulting index would be outside.
    let mut k = 0;
    for j in 0..ndims {
        k += dimprod[j]
            * (origin[j] + ioffs[j] as usize).min(self.dims[j].saturating_sub(1));
    }

    // ... then, maybe, finally, accumulate the contribution of the vertex
}

That expanded a bit from concept to execution, but it's also one of the places where Rust's particulars shine: the saturating_{op} functions are really excellent for doing safe indexing, and while they may represent extra operations in some places, usually wherever there's a saturating_{op}, there's a slice indexing bounds check that can be optimized out.

And since we're examining extrapolation as well as interpolation, these indexing edge cases aren't academic - they're the product.

Doing interpolation with the geometric method is easy

let mut interped = T::zero(); // The accumulated interpolated value
for i in 0..nverts {
    //  ... sometime later in the loop where we did some shenanigans
    //      with a mod operation to find the `ioffs` entries
    //      and after we hand-roll array indexing

    // Get the value at this vertex
    let v = vals[k];

    // Find the vector from the opposite vertex to the observation point
    let mut dxs = [T::zero(); ndims];
    for j in 0..ndims {
        // Populate dxs[j] with the (unsigned) vector
    }

    // Accumulate contribution from this vertex
    if !extrapolating {
        // Interpolating is easy! Just do a cumulative product
        // to get the sub-cell volume for this vertex which
        // tells us its weight in the weighted average
        let vol = dxs.iter().fold(T::one(), |acc, x| acc * *x);
        interped = interped + v * vol;
    }
    else {
      // Extrapolating
      // ...
    }
}

// Return the interpolated total, dividing through by the total
// cell volume just one time at the end
interped / cell_vol

Extending the geometric method to extrapolation

The geometric method sounds good on paper, but let's see how it holds up in a practical implementation. In particular, we need it to work well (1) in extrapolation and (2) in higher dimensions.

I filled up a notebook with doodles and algebra figuring this out, but I'll spare us both the excessive summations and keep the abuses of notation between me and the graph paper. Visual depictions of computational geometry fare better than obfuscatory equation soup, anyway.

Extrapolation without adjustment

Let's try extrapolating in 2D.

2d geometric method extrapolated in corner region

Well, that didn't work - at least not without some extra handling.

Extrapolation by extending the grid?

One option would be to extend the grid cell to include the full area including the observation point.

On inspection, this is the same as the recursive method - we would need to treat each dimension in extrapolation one at a time, actualize the new grid, then finish the interpolation (which, in this case, would just be taking one corner of the grid as the final value).

That's not great. For one thing, it would have a worse memory scaling, and I'd be implementing both methods instead of just one. For another, it produces a dubious result when the grid data describes a nonlinear function: the extrapolation would twist the grid cell like a potato chip, effectively extrapolating the derivatives as well as the value and producing a kind of quadratic extrapolation instead of a linear one.

There's a better way, but it involves chewing a bit of glass.

Extrapolation by a pile of extra logic

I don't have a good name for this process. It's a turducken of constructive geometry and plain algebra - basically reverse-engineering the failure of the method to naturally extend to extrapolation, and eliminating those failures.

We clearly need to zero-out the contribution from vertex A as we move into extrapolation to produce the same result we would get by taking the finite difference from B to C and from D to C, and extrapolating linearly on each axis - the hand-written extrapolation wouldn't include a contribution from from A at all.

We also need to negate the contribution from the grid points on the interior of the grid - the areas that split the grid cell evenly before are now overlapping, and the result we need is no longer any weighted average of the grid values. But, imagining that we were in 3D, we could be still need to do some grid-cell-based weighted-averaging, because we could be extrapolating on two axes (as in the 2D example) while interpolating on a third axis.

Once we've zeroed-out the contribution from the inner point A and negated the contributions from points B and D, we have another problem.

All three non-zeroed points B, C, and D share an overlapping region which has an area that scales not linearly but quadratically with the position of the observation point. This can't be part of the final interpolated result, but it's easy enough to find the area of the nonlinear region and remove it. This removal process is, conveniently, still an O(1) process for each vertex of the grid cell, so while it makes extrapolation slightly slower, it doesn't change the overall perf scaling.

2d geometric method extrapolated in corner region, with corrections

Extending the geometric method to higher dimensions

Well, that's settled, right? It's not, but you get to find out the way I found out - after slogging through a 3D implementation.

Let's see how it looks in 3D, interpolating to points on the interior of the grid.

To start, we'll look at a point that's on one face of the grid cell, so that we can see what's going on without a forest of lines getting in the way, and leaving two of the sub-cells un-shaded to unclutter the diagram.

3d geometric method, a simple case on the interior

So far, so good. With a bit of staring and thinking, it's clear that this produces the same behavior as the 2D version as the observation point moves around the face of the cube.

As it moves to the interior of the cube, things get a little more detailed.

3d geometric method, a simple case on the interior

Here I'm only shading two sub-cells again to keep the clutter down, but it's clear to see that the partitioning still fills the space, and the volume of each sub-cell scales linearly with the observation point's position on each axis.

Anyway, it's all working well so far! Let's see how the fixes from the 2D method hold up under extrapolation in a side region.

3d geometric method, extrapolating in a side region

Looks good! Only two of the regions are shaded again, but from this, we can already tell that negating the contribution from the points on the interior of the grid will have the intended effect, same as in 2D.

Next, a look at the corner region, where we can exercise the fixes that zero-out the contribution from the entirely-inside vertex and trim the nonlinear region.

A problem in higher dimensions

3d geometric method, extrapolating in a corner region

Uh-oh. Only shading the nonlinear region(s) this time, because that's the catch: there isn't just one kind of nonlinear region, and the fix based on the 2D method only capture the cubic region, not the quadratic regions.

A solution in higher dimensions

It turns out, we can handle this, too - but not by enumerating all of the kinds of nonlinear region in higher dimensions. Or at least, I don't know how to do that. Instead, we can observe that all of the linear regions are of the same form, the rectangular prisms projecting from each face of the cube. We can also make a real claim that this will be the case in all higher dimensions, because any region that grows in a direction not aligned with the coordinate axes will have a nonlinear volume scaling, so all the linear regions must be aligned with the coordinate axes.

So, rather than removing all the bad regions, which only works well in 2D, we can cycle through all the axes and take all the good regions that are projected from the n-cube's faces. This works, including in higher dimensions, and produces a true linear extrapolation outside the grid. However, it does cost in performance: for the corner regions where this culling process is required, the cost of evaluating an observation point scales like O(2^ndim + ndim^2) instead of just O(2^ndim).

For practical purposes, this is not too bad, but it is noticeable - about a factor of 2-4 reduction in throughput when extrapolating in corner regions, which also breaks timing guarantees (not ideal for, say, embedded control systems).

With that, we have a method that generalizes to arbitrary dimensions, or at least to the 10 dimensions I've tested it in, under both interpolation and extrapolation in every type of region. And it can do this with no allocation (except, possibly, for somewhere to put the output) and with stack usage that scales linearly in the number of dimensions.

Next Steps

I am happy to report that due to the efforts of the cargo-semver-checks, release-plz, maturin, and PyO3 developers, I can gloss over the relatively painless process of releasing the rust library on crates.io, generating python bindings, and releasing the python bindings on PyPI.

Instead of agonizing over tooling and processes, I can focus on extending the existing functionality. Next on the list, in any particular order that I find inspiration, are:

Proper multi-cubic (Hermite spline) interpolation
Vectorized alternatives to the multilinear methods here (for the high-throughput and timing-sensitive regimes)
Triangular and tetrahedral mesh interpolation
Perf optimizations of the python bindings for the existing multilinear interpolators

October 21, 2023
in Scientific Computing, Embedded, no-std
6 min read

2023-10-21: Array Expressions without Allocation

Overview

Github: https://github.com/jlogan03/strobe
Docs: https://docs.rs/crate/strobe/latest
Benchmarks: https://github.com/jlogan03/strobe/pull/6
Tooling
Checking assembly & vectorization recipes: Compiler Explorer & cargo-asm
Benchmarking & perf analysis: criterion & valgrind
Linting: rustfmt & clippy
Release & versioning: release-plz & cargo-semver-checks

Strobe provides fast, low-memory, elementwise array expressions on the stack. It is compatible with no-std (and no-alloc) environments, but can allocate for outputs as a convenience. Its only required dependencies are num-traits, libm, and rust core.

In essence, this allows a similar pattern to operating over chunks of input arrays, but extends this to expressions of arbitrary depth, and provides useful guarantees to the compiler while eliminating the need for allocation of intermediate results and enabling consistent vectorization.

Development Process

This crate started as a thought-experiment using Rust's benchmarking utilities on nightly to explore how much of a typical array operation's runtime was related to allocation and how much was the actual numerical operations. As it turned out, under sporadic compute loads (meaning, with a cold heap) more than half of the run time of a vectorized array multiplication was spent just allocating storage.

That observation won't be evident in the benchmarks below, because criterion does a number of cycles before the start of measurement in order to warm up the system and prevent allocation noise from being observed. This gives better quality data, but it's also important to keep in mind that under completely normal conditions, the pre-allocated and non-pre-allocated versions of these calculations give completely different performance results.

I take this observation as evidence of a need to eliminate allocation from the evaluation of array expressions to the maximum possible extent - possibly leaving a single allocation at the end of the expression for the outputs.

With this in mind, the bucket of parts that would become strobe moved to a single file in Compiler Explorer and experienced a number of iterations through making changes then doing ctrl-F for mulpd in the assembly to check whether the resulting calcs were vectorizing properly.

The next key observation was that even with pre-allocated storage available on the heap, it was faster to copy segments of data into mutable storage on the stack than to send data back and forth from heap-allocated storage, even though this sometimes results in more copy events in the code as it is written.

At this point, there's (1) a lot to gain by eliminating allocation for intermediate results and (2) not much to lose by copying data around in intermediate storage on the stack. This gives a natural form for the expression data structure and evaluation:

Structure: Each expression node carries a fixed-size segment of storage just large enough to take advantage of vector operations, with controlled memory alignment to make sure we don't do any unaligned read/write.
Evaluation: Each expression node populates its fixed storage by operating on data from the fixed storage of its child nodes.

Dead simple, no allocation, and no clever tricks to confuse the user. Each expression node does not need any information other than what its inputs are and what operation to perform - no multidirectional mutable linkages handled by a third orchestrator, no graph data structures, and exactly one lifetime bound in play.

The evaluation takes place by pumping inputs and intermediate values from the leaves (inputs) through each operator node to the root:

That's it. It's just a tree-structured expression, with no unnecessary fanciness. There's nothing in there that could be removed and leave it still working.

Benchmarks

Criterion

For large source arrays in nontrivial expressions, it is about 2-3x faster than the usual method for ergonomic array operations (allocating storage for each intermediate result).

In fact, because it provides guarantees about the size and memory alignment of the chunks of data over which it operates, it is even faster than performing array operations with the full intermediate arrays pre-allocated for most nontrivial applications (at least 3 total array operations).

This is demonstrated in a set of benchmarks run using Criterion, which makes every effort to warm up the system in order to reduce the overhead associated with heap allocations. Three categories of benchmarks are run, each evaluating the same mathematical operations over the same data:

Slice-style vectorization, allocating for each intermediate result & the output
Slice-style vectorization, evaluating intermediate results into pre-allocated storage, then allocating once for the output
Strobe, using fixed-size intermediate storage on the stack, then allocating once for the output

While the allocation for the output array might give an unnecessarily pessimistic view of the maximum throughput of the pre-allocated and Strobe versions of the calc, I believe this gives a comparison that is more relevant to common use cases.

By the time we reach 4 multiplication operations in an expression, strobe sees about double the throughput of the other methods for array sizes larger than about 1e5 elements, and is only slightly slower than the fastest alternative for smaller sizes. The benefit for large array operations is large, and the penalty for small ones is small.

4x mul in an expression

The worst case, where Strobe is used gratuitously for a single operation with a small array, is about 3x worse than just doing a contiguous slice operation:

gratuitous use for a single mul

The above results are obtained with a 2018 Macbook Pro (Intel x86). Similar scalings are obtained with a Ryzen 5 3600, with slight differences in the crossover point between methods, likely due to differences in cache size and memory performance. Results from both systems used for benchmarking and for more calculations are available here.

Cachegrind

Unit tests use arrays of size 67, 3 more than the intermediate storage size of 64, in order to make sure that we visit the behavior of the system when it sees storage that is neither full nor empty in a given cycle. The first 64 values will be consistently vectorized and cached properly, as will the first two in the group of 3 values, so we can expect around 1/67th (1.5%) of the evaluations to miss the cache due to swapping from the vector loop to the scalar loop. Cachegrind run over the unit tests supports this napkin estimate:

jlogan@jlogan-MS-7C56:~/git/strobe$ valgrind --tool=cachegrind ./target/release/deps/strobe-5cb1bc954f8de3f1
==24414== Cachegrind, a cache and branch-prediction profiler
==24414== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==24414== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==24414== Command: ./target/release/deps/strobe-5cb1bc954f8de3f1
==24414==
--24414-- warning: L3 cache found, using its data for the LL simulation.

running 33 tests
test test::test_atanh ... ok
...(truncated)...
test test_std::test_std ... ok

test result: ok. 33 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.16s

==24414==
==24414== I   refs:      1,992,045
==24414== I1  misses:       22,729
==24414== LLi misses:        4,408
==24414== I1  miss rate:      1.14%
==24414== LLi miss rate:      0.22%
==24414==
==24414== D   refs:        736,224  (451,692 rd   + 284,532 wr)
==24414== D1  misses:       15,937  (  8,971 rd   +   6,966 wr)
==24414== LLd misses:        7,806  (  3,756 rd   +   4,050 wr)
==24414== D1  miss rate:       2.2% (    2.0%     +     2.4%  )
==24414== LLd miss rate:       1.1% (    0.8%     +     1.4%  )
==24414==
==24414== LL refs:          38,666  ( 31,700 rd   +   6,966 wr)
==24414== LL misses:        12,214  (  8,164 rd   +   4,050 wr)
==24414== LL miss rate:        0.4% (    0.3%     +     1.4%  )