Smallest Inc.
GPU Optimisation Engineer | SF
Smallest Inc., San Francisco, California, United States, 94199
Role
We’re hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level — someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance. You’ll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures. Your work directly impacts the latency, throughput, and reliability of smallest’s real-time speech models.
What You’ll Do
Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware
Profile models end-to-end to identify GPU bottlenecks — memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints
Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections
Perform operator fusion, graph optimization, and kernel-level scheduling improvements
Tune models to fit GPU memory limits while maintaining quality
Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators
Port models across GPU chipsets (NVIDIA → AMD / edge GPUs / new compute backends)
Work with TensorRT, ONNX Runtime, and custom runtimes for deployment
Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads
Requirements
Strong understanding of
GPU architecture
— SMs, warps, memory hierarchy, occupancy tuning
Hands‑on experience with
CUDA , kernel writing, and kernel‑level debugging Experience with
kernel fusion
and model graph optimizations
Familiarity with
TensorRT, ONNX, Triton, tinygrad, or similar inference engines
Strong proficiency in
PyTorch
and Python
Deep understanding of
model architectures
(transformers, convs, RNNs, attention, diffusion blocks)
Experience profiling GPU workloads using Nsight, nvprof, or similar tools
Strong problem‑solving abilities with a performance‑first mindset
Great to Have
Experience with quantization (INT8, FP8, hybrid formats)
Experience with audio/speech models (ASR, TTS, SSL, vocoders)
Contributions to open‑source GPU stacks or inference runtimes
Published work related to systems‑level model optimization
Who Will Succeed in This Role Someone who:
thinks in kernels, not just layers
knows which optimizations are theoretical vs practically impactful
understands GPU boundaries (memory, bandwidth, latency) and how to work around them
is excited by the challenge of ultra‑low latency and large‑scale real‑time inference
loves debugging at the CUDA + model level
#J-18808-Ljbffr
What You’ll Do
Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware
Profile models end-to-end to identify GPU bottlenecks — memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints
Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections
Perform operator fusion, graph optimization, and kernel-level scheduling improvements
Tune models to fit GPU memory limits while maintaining quality
Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators
Port models across GPU chipsets (NVIDIA → AMD / edge GPUs / new compute backends)
Work with TensorRT, ONNX Runtime, and custom runtimes for deployment
Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads
Requirements
Strong understanding of
GPU architecture
— SMs, warps, memory hierarchy, occupancy tuning
Hands‑on experience with
CUDA , kernel writing, and kernel‑level debugging Experience with
kernel fusion
and model graph optimizations
Familiarity with
TensorRT, ONNX, Triton, tinygrad, or similar inference engines
Strong proficiency in
PyTorch
and Python
Deep understanding of
model architectures
(transformers, convs, RNNs, attention, diffusion blocks)
Experience profiling GPU workloads using Nsight, nvprof, or similar tools
Strong problem‑solving abilities with a performance‑first mindset
Great to Have
Experience with quantization (INT8, FP8, hybrid formats)
Experience with audio/speech models (ASR, TTS, SSL, vocoders)
Contributions to open‑source GPU stacks or inference runtimes
Published work related to systems‑level model optimization
Who Will Succeed in This Role Someone who:
thinks in kernels, not just layers
knows which optimizations are theoretical vs practically impactful
understands GPU boundaries (memory, bandwidth, latency) and how to work around them
is excited by the challenge of ultra‑low latency and large‑scale real‑time inference
loves debugging at the CUDA + model level
#J-18808-Ljbffr