JAX Pallas and Blackwell: Unlocking Peak GPU Performance with Python

JAX Pallas is a new Python tool that helps programmers write super-fast GPU code for the powerful NVIDIA Blackwell chips. By using easy Python commands, you can create special matrix math that works much better than normal libraries. The official tutorial shows how to write a tiny, hand-tuned program that almost reaches the GPU’s full speed, helping save a lot of time and electricity during big AI training jobs. Even though Pallas is experimental, it lets you plug your custom code right into existing projects and get big speed boosts right away. This makes it easier for everyone to get the best performance out of the latest GPUs using just Python.

What is JAX Pallas and how does it help optimize GPU performance on NVIDIA Blackwell?

JAX Pallas is an experimental module that allows developers to write low-level GPU code in familiar Python using jax.numpy syntax. Compiling with Triton, it enables hand-tuned kernels that outperform generic BLAS libraries, leveraging Blackwell GPU features for significant speed and efficiency gains.

Why a Python Notebook Suddenly Matters to GPU Architects

JAX just published a hands-on tutorial that walks through building a hand-tuned matrix-multiplication kernel for NVIDIA’s upcoming Blackwell GPUs. The guide relies on the experimental Pallas module, which lets developers script low-level GPU code in familiar jax.numpy style and then have it compiled through Triton into efficient PTX.

Blackwell in One Paragraph

The B200 flagship carries a dual-die design with 208 billion transistors and feeds each die with four HBM3e stacks. Combined bandwidth tops out at 34 TB per second, while FP8 compute is rated at 20 PFLOPS according to NVIDIA’s architecture brief captured by the Cudo Compute analysis here. Those figures reshape the ceiling for large model training but also expose bottlenecks in generic BLAS libraries, especially when developers target newer mixed-precision formats like FP4 and FP6.

What the Tutorial Covers

Thread-block tiling – choosing tile sizes that match Tensor Core warp scheduling on Blackwell’s fifth-generation Tensor Cores.
Software-pipelined memory movement – overlapping HBM reads with compute so each 192-KB shared memory segment stays saturated.
Ephemeral layouts – writing accumulators in FP32 but down-converting to FP8 only when storing back to HBM, minimizing precision loss.
Autotuned launch parameters – letting JAX trace alternate block shapes and pick the variant that maximizes occupancy.

Code snippets stay tight. The highlight is an eight-line kernel that attains 91 percent of the GPU’s theoretical FP8 peak in the benchmark section. For context, that single kernel reaches roughly 18 PFLOPS on a B200 at stock clocks, eclipsing the vendor-supplied cuBLAS routine in the same configuration. Full benchmark tables are included in the notebook and replicate consistently on early-access Blackwell instances hosted by CoreWeave.

Why Hand-Tuning Still Pays Off

GEMM dominates transformer workloads, so any percent saved per multiply-accumulate scales linearly with training time. Hand-tuned kernels have shown 5-20 percent wall-clock savings in production training jobs, a range echoed by Oak Ridge’s GEMM tuning report PDF. On a cluster of 256 B200 GPUs, a 10 percent gain translates to hundreds of GPU-hours per day reclaimed for new experiments.

Energy efficiency follows the same curve. NVIDIA claims Blackwell cuts inference wattage up to 25x relative to Hopper when code is optimized for FP4 and FP6. The JAX tutorial demonstrates exactly how to hit those formats without drifting from Python.

Adopting the Workflow

Install the nightly JAX build with pip install --pre jax[cuda].
Clone the tutorial notebook from the official JAX-examples repo.
Run on any Blackwell preview instance or emulate locally with CUDA 12.4 to verify compilation.
Swap in your model’s shape parameters, rerun the autotuner, and export the kernel as a reusable Python function.

Developers comfortable with JIT-driven libraries will appreciate that Pallas kernels coexist with higher-level jax.lax calls. You can integrate the new GEMM into an existing training loop by a single function pointer swap. The notebook even includes a microbenchmark that drops the tuned kernel into a GPT attention block and plots end-to-end speedups.

Looking Ahead

Pallas remains experimental according to its documentation here, yet the Blackwell tutorial signals a clear direction: high-level Python ergonomics paired with hardware-aware control. As chip complexity rises, such blueprints will likely become a requirement rather than an optimization luxury.

What is Pallas and why does it matter for GPU programming in JAX?

Pallas is a JAX extension that lets you write custom GPU and TPU kernels while still using the familiar jax.numpy API. Instead of dropping into raw CUDA, you express your kernel logic in Python; Pallas then translates it to Triton for NVIDIA GPUs or Mosaic for TPUs. This gives you fine-grained control over tiling, memory movement and register usage without abandoning JAX’s functional, just-in-time compilation model. The result is a single source file that can be auto-tuned for Blackwell’s 34 TB/s HBM3e bandwidth or for any other accelerator, cutting weeks of low-level work down to hours.

How much faster can hand-tuned GEMM kernels make real training & inference jobs?

Published studies show 5-20 % end-to-end speed-ups on large language and vision models when GEMM kernels are matched to the exact matrix shapes and hardware. NVIDIA’s latest heuristics-driven tuner reaches 99 % of exhaustive-search performance while reducing tuning time by 5×, and Stanford’s AI-generated CUDA kernels have hit 104 % of the vendor library baseline. For a trillion-parameter model that means days of training time saved and, on Blackwell, up to 25× lower energy per inference token versus Hopper.

Which Blackwell hardware features does the JAX tutorial specifically target?

The walkthrough exploits four Blackwell innovations:

208 B-transistor dual-die design – tutorials teach you to keep both dies busy with double-sized tiles.
1 TB/s per HBM3e stack – kernels are staged so that each 128×128 tile is loaded exactly once into shared memory, hiding 1 TB/s latency behind compute.
FP6/FP4 Tensor Cores – Pallas code templates show how to switch precision on the fly, delivering 40 PFLOPS at FP4 for inference-heavy graphs.
50 GB/s 5th-gen NVLink – the guide includes an all-reduce micro-kernel that sustains 7.2 TB/s collective bandwidth across 576 GPUs, critical for 740 B-parameter model shards.

Is Pallas production-ready or still an experiment?

Pallas remains experimental in 2025; the API changes monthly and some JAX primitives (e.g. custom VJPs) cannot yet be called inside a kernel. However, the repository is actively maintained and the Blackwell tutorial ships with version-pinned containers (JAX 0.4.36 + CUDA 12.6) that have been validated on CoreWeave GB200 instances. Teams are encouraged to pin those images for production while keeping an eye on changelog before each upgrade.

Can the same Pallas kernel run on both GPU and TPU?

Not without modification. GPU kernels compile through Triton, whereas TPU kernels compile through Mosaic; memory hierarchies, tile sizes and even data types differ. The tutorial therefore provides separate starter templates:

matmul_blackwell_gpu.py – 256-thread warps, shared-memory swizzling, FP6/FP4 support.
matmul_tpu_v5e.py – 128×128 MXU tiles, HBM scalar prefetch, bfloat16 native.

You can share high-level algorithmic code (e.g. blocking loops), but the hardware-specific parameters must be re-tuned for each platform.