GSoC 2026 · NumFOCUS · scikit-bio

GSoC 2026: Bringing Numba-Optimized Performance to scikit-bio

I was accepted into Google Summer of Code 2026 with NumFOCUS/scikit-bio for a 350-hour project replacing performance-critical Cython paths with Numba CPU and GPU implementations.

scikit-bio ↗ NumFOCUS GSoC →
350hLarge project
2.4xPrototype speedup
<1e-15Numerical agreement
May 25Coding starts

Accepted into GSoC 2026

On May 1, 2026, Google notified me that my proposal with NumFOCUS had been accepted for Google Summer of Code 2026. The accepted project title is Numba Optimized Implementations for scikit-bio's Performance-Critical Functions.

The project is under scikit-bio, a BSD-licensed Python library for biological omic data analysis. It sits inside the NumFOCUS ecosystem and is used in workflows around distance matrices, ordination, compositional data analysis, diversity metrics, and biological sequence analysis.

This is not a generic performance project. The work targets the computational core behind PCoA, Mantel tests, PERMANOVA, and related bioinformatics workflows where large biological datasets make performance and memory behavior matter.

The Problem: Fast CPU Code, GPU-Blind Internals

scikit-bio currently has two important execution paths for numerical code:

PathStrengthLimitation
CythonFast on CPU with single-pass memory patternsCannot directly operate on GPU-backed arrays because typed memoryviews require host memory
Array API / xpWorks across NumPy, CuPy, JAX, PyTorch, and Dask-style backendsOften slower on CPU because vectorized operations make multiple passes through the matrix

That creates a divergence problem: the fastest CPU code paths are locked behind Cython, while the GPU-capable paths live in separate pure Array API implementations. Over time, two parallel implementations become harder to maintain and easier to subtly diverge.

The Proposed Solution: One Source, Three Execution Paths

The accepted proposal replaces selected Cython functions with Numba implementations. Numba lets us write Python code that compiles to optimized native CPU code via LLVM, while also giving an explicit CUDA kernel path for GPU execution.

1. Numba CPU

@njit(parallel=True, cache=True) for NumPy arrays on CPU. This is the default fast path and should match or exceed the existing Cython implementation.

2. Numba CUDA

@cuda.jit for memory-bandwidth-bound operations where a fused kernel beats multiple Array API kernels on GPU-backed arrays.

3. Array API fallback

The xp path remains the safety net for backends like JAX, PyTorch, Dask, or environments where Numba is not installed.

Numba will be an optional dependency. If it is unavailable, users still get correct results through the existing Array API-compatible fallback path.

The Prototype That Made the Proposal Concrete

The core prototype was for center_distance_matrix, which is used in Principal Coordinates Analysis (PCoA). The algorithm performs Gower centering: square the distance matrix, accumulate row means, compute a global mean, and apply centering.

center_distance_matrix Numba shape
from numba import njit, prange
import numpy as np

@njit(parallel=True)
def center_distance_matrix_numba(D):
    n = D.shape[0]
    E = np.empty((n, n), dtype=D.dtype)
    row_means = np.empty(n, dtype=D.dtype)
    global_sum = 0.0

    for i in prange(n):
        row_sum = 0.0
        for j in range(n):
            val = -0.5 * D[i, j] * D[i, j]
            E[i, j] = val
            row_sum += val
        row_means[i] = row_sum / n
        global_sum += row_sum

    global_mean = global_sum / (n * n)
    for i in prange(n):
        for j in range(n):
            E[i, j] = E[i, j] - row_means[i] - row_means[j] + global_mean
    return E

The important part is not the decorator alone. It is the memory-access pattern: one pass for squaring and row accumulation, then one pass for centering. That mirrors the Cython advantage while keeping the implementation in Python.

Matrix sizeCythonNumba @njit(parallel=True)
n = 5000.39 ms0.34 ms
n = 10001.75 ms1.01 ms
n = 300019.3 ms7.91 ms

At n=3000, the Numba prototype was 2.4x faster than Cython in the same benchmark session, with numerical agreement below 1e-15.

What I Will Build

The accepted scope focuses on the distance and ordination modules: the places where performance-critical matrix operations appear repeatedly.

Tier 1

Ordination Cython replacements

Replace e_matrix_means_cy, f_matrix_inplace_cy, and center_distance_matrix_cy. These underpin PCoA and are the direct continuation of the working prototype.

Tier 2

Distance module Cython replacements

Implement Numba versions of Mantel permutation correlation, PERMANOVA pseudo-F statistics, and geometric median routines where the loop and reduction patterns fit Numba.

Tier 3

Distance utilities

Port validation and matrix reordering utilities such as is_symmetric_and_hollow and distmat_reorder where straightforward Numba implementations can remove more Cython dependency surface.

Tier 4

Ordination utilities

Add Array API and Numba-aware paths for utilities like mean_and_std, scale, corr, e_matrix, and f_matrix.

Not every Cython file is a good Numba target. Alignment traceback, tree traversal, phylogenetic dictionary lookups, and treap-like metadata structures are explicitly out of scope because they do not match Numba's strengths.

Mentorship

Igor Sfiligoi

Primary mentor from UC San Diego / San Diego Supercomputer Center. His work focuses on HPC, GPU acceleration, cache-aware restructuring, and bioinformatics performance.

Qiyun Zhu

Co-mentor from Arizona State University and a key architect of scikit-bio's Array API migration, including the ingest_array / xp pattern.

Matthew Aton

Co-mentor from Arizona State University working on Array API infrastructure such as decorators for backend-compatible functions and tests.

The technical bar is clear: benchmarks must be honest, numerical claims must be verified, and performance work must respect memory access patterns rather than chasing decorator-based speedups blindly.

Timeline

1

Community Bonding: May 1 - May 24

Set up the development environment, confirm scope, audit Cython files, reproduce prototype benchmarks, study mentor papers, and prepare a warm-up PR.

2

Weeks 1-6: May 25 - July 5

Build Numba integration infrastructure, replace ordination Cython functions, implement Mantel and PERMANOVA Numba paths, and prepare for midterm evaluation.

3

Midterm: July 6 - July 10

Target state: Tier 1 complete and merged, with Mantel and PERMANOVA implementations ready for review.

4

Weeks 7-12: July 11 - August 17

Finish utilities, GPU kernels, benchmark tables, documentation, review cleanup, final report, and final blog post.

End of May: First Week of Coding

The official coding period began May 25, 2026. This update covers what shipped, what is in active review, and the engineering lessons that came out of the first week of work.

Merged Pull Requests

scikit-bio #2464

Numba CPU permuted Pearson helpers for Mantel

Direct Cython-to-Numba conversion of mantel_perm_pearsonr_cy and its condensed-form variant. The Numba implementation preserves the same loop structure and parallelism axis as the original Cython. A NUMBA_AVAILABLE dispatch keeps Numba an optional dependency: if it is not installed, the existing Cython path stays in service unchanged.

scikit-bio #2468

Numba-aware Mantel tests and CI workflow

Test infrastructure for the Numba conversion work: paired _cy and _numba tests against the same fixtures, dispatch fallback tests via patch.object(NUMBA_AVAILABLE, False), and a numba_code pytest marker. The CI workflow runs the marker across four operating systems (Ubuntu x86_64, Ubuntu ARM64, macOS, Windows) on Python 3.14.

Pull Request in Active Review

scikit-bio #2470

Numba CPU s_W helpers for PERMANOVA

Direct conversion of permanova_f_stat_sW_cy and the condensed-form variant. CI is green across the full Python matrix on four operating systems. Review feedback has been addressed through follow-up commits including a partial-sum simplification, a test class restructure that mirrors the Mantel test architecture, and replacement of an inline Python reference computation with a hardcoded value from the trusted Cython implementation.

Numba 0.65.1 parfor cycle bug

The natural Numba code for PERMANOVA's s_W crashes at compile time with AssertionError: unexpected cycle in lookup() from inside the parfor reduction-analysis pass. The crash needs two features together: a cross-iteration scalar reduction in the outer prange, and an if conditional inside the inner loop that reads from a parameter array. Either feature alone compiles fine.

The minimal trigger
@njit(parallel=True)
def natural_sW(mat, group_sizes, grouping):
    n = mat.shape[0]
    s_W = 0.0
    for row in prange(n - 1):
        group = grouping[row]
        local = 0.0
        for col in range(row + 1, n):
            if grouping[col] == group:        # conditional read of a parameter
                val = mat[row, col]
                local += val * val
        s_W += local / group_sizes[group]      # cross-iteration scalar reduction
    return s_W

The workaround replaces the scalar reduction with per-iteration writes to a partials array, summed once after the prange exits. The conditional stays in place. Beyond sidestepping the bug, this pattern has a useful property for the future GPU work: parallel=True reductions are a CPU-only Numba feature, but the per-iteration-slot pattern maps directly to @cuda.jit.

Engineering lesson: tests should validate against trusted references

One review point that came up on PR #2470 was about test design. The initial test class included a Python reimplementation of the s_W computation as the "expected value" for the paired _cy and _numba tests. That carries a subtle bug-masking risk: if the test reimplementation and the function under test share the same misunderstanding of the formula, both pass together while being wrong together.

The fix looks obvious in hindsight. Compute the expected value once using the trusted reference implementation (here, the pre-existing Cython), hardcode the resulting number, and assert against it. The test no longer carries an algorithmic implementation. It carries a single value from a trusted source.

Peer collaboration on PCoA

Another GSoC contributor on the same scikit-bio project is working on Numba CPU and GPU implementations of center_distance_matrix for PCoA. The first week of collaboration covered how to structure the work as separate CPU and GPU PRs, how to slot the new tests into the existing Numba CI workflow, and how the Mantel and PERMANOVA architecture can serve as a template for that branch.

What's queued next

Three pieces of work are queued behind PR #2470:

  1. Interface cleanup PR. Make Numba opt-in rather than auto-selected on installation, and rename the _numba function suffix to _nb to match the existing _cy convention. This will target Mantel and PERMANOVA at the same time.
  2. Block-permutation optimization for PERMANOVA. Process multiple permutations per matrix scan, with cache tiling and a parallelization axis flipped from matrix rows to permutation blocks. The C++ reference for this is in the scikit-bio-binaries repository.
  3. GPU @cuda.jit implementation. Builds on the optimized CPU code. The partials pattern from the parfor workaround already aligns with the structure needed for this transition.

Why This Matters

scikit-bio is not an isolated library. It supports downstream scientific workflows in ecosystems like QIIME 2, Emperor, and Qiita. When performance-critical matrix operations become faster and more GPU-capable, the benefit can propagate into real biological data analysis pipelines.

The engineering lesson is bigger than one library: modern scientific Python needs graceful heterogeneity. CPU should be fast. GPU should be possible. Fallback paths should be correct. Maintainers should not have to carry multiple drifting implementations forever.

My goal for the summer is simple: make scikit-bio's fastest paths easier to maintain, easier to benchmark, and ready for CPU and GPU execution without compromising numerical correctness.