Positron
Positron.ai specializes in developing custom hardware systems to accelerate AI inference. These inference systems offer significant performance and efficiency gains over traditional GPU-based systems, delivering advantages in both performance per dollar and performance per watt. Positron exists to create the world's best AI inference systems.
Senior Software Engineer Machine Learning Systems & High-Performance LLM Inference
We are seeking a
Senior Software Engineer
to contribute to the development of high-performance software that powers execution of
open-source large language models (LLMs) on our custom appliance
. This appliance leverages a combination of
FPGAs and x86 CPUs
to accelerate
transformer-based models
. The software stack is written primarily in
modern C++ (C++17/20)
and heavily relies on
templates, SIMD optimizations, and efficient parallel computing techniques
. Key Areas of Focus & Responsibilities
Design and implement
high-performance inference software
for LLMs on custom hardware. Develop and optimize
C++-based libraries
that efficiently utilize
SIMD instructions, threading, and memory hierarchy
. Work closely with FPGA and systems engineers to ensure efficient data movement and computational offloading between x86 CPUs and FPGAs. Optimize model execution via
low-level optimizations
, including vectorization, cache efficiency, and hardware-aware scheduling. Contribute to
performance profiling tools and methodologies
to analyze execution bottlenecks at the instruction and data flow levels. Apply
NUMA-aware memory management techniques
to optimize memory access patterns for large-scale inference workloads. Implement
ML system-level optimizations
such as token streaming, KV cache optimizations, and efficient batching for transformer execution. Collaborate with ML researchers and software engineers to integrate
model quantization techniques, sparsity optimizations, and mixed-precision execution
. Ensure all code contributions include
unit, performance, acceptance, and regression tests
as part of a
continuous integration-based development process
. Required Skills & Experience
7+ years
of professional experience in
C++
software development, with a focus on
performance-critical applications
. Strong understanding of
C++ templates and modern memory management
. Hands-on experience with
SIMD programming
(AVX-512, SSE, or equivalent) and
intrinsics-based vectorization
. Experience in
high-performance computing (HPC), numerical computing, or ML inference optimization
. Experience with
ML model execution optimizations
, including efficient
tensor computations and memory access patterns
. Knowledge of
multi-threading, NUMA architectures, and low-level CPU optimization
. Proficiency with
systems-level software development
, profiling tools (perfetto, VTune, Valgrind), and benchmarking. Experience working with
hardware accelerators (FPGAs, GPUs, or custom ASICs)
and designing
efficient software-hardware interfaces
. Preferred Skills (Nice to Have)
Familiarity with
LLVM/Clang or GCC compiler optimizations
. Experience in
LLM quantization, sparsity optimizations, and mixed-precision computation
. Knowledge of
distributed inference techniques
and
networking optimizations
. Understanding of
graph partitioning and execution scheduling
for large-scale ML models. Why Join Us?
Work on a
cutting-edge ML inference platform
that redefines
performance and efficiency
for LLMs. Tackle
challenging low-level performance engineering problems
in AI and HPC. Collaborate with a team of
hardware, software, and ML experts
building an industry-first product. Opportunity to contribute to and shape the future of
open-source AI inference software
.
#J-18808-Ljbffr
Senior Software Engineer
to contribute to the development of high-performance software that powers execution of
open-source large language models (LLMs) on our custom appliance
. This appliance leverages a combination of
FPGAs and x86 CPUs
to accelerate
transformer-based models
. The software stack is written primarily in
modern C++ (C++17/20)
and heavily relies on
templates, SIMD optimizations, and efficient parallel computing techniques
. Key Areas of Focus & Responsibilities
Design and implement
high-performance inference software
for LLMs on custom hardware. Develop and optimize
C++-based libraries
that efficiently utilize
SIMD instructions, threading, and memory hierarchy
. Work closely with FPGA and systems engineers to ensure efficient data movement and computational offloading between x86 CPUs and FPGAs. Optimize model execution via
low-level optimizations
, including vectorization, cache efficiency, and hardware-aware scheduling. Contribute to
performance profiling tools and methodologies
to analyze execution bottlenecks at the instruction and data flow levels. Apply
NUMA-aware memory management techniques
to optimize memory access patterns for large-scale inference workloads. Implement
ML system-level optimizations
such as token streaming, KV cache optimizations, and efficient batching for transformer execution. Collaborate with ML researchers and software engineers to integrate
model quantization techniques, sparsity optimizations, and mixed-precision execution
. Ensure all code contributions include
unit, performance, acceptance, and regression tests
as part of a
continuous integration-based development process
. Required Skills & Experience
7+ years
of professional experience in
C++
software development, with a focus on
performance-critical applications
. Strong understanding of
C++ templates and modern memory management
. Hands-on experience with
SIMD programming
(AVX-512, SSE, or equivalent) and
intrinsics-based vectorization
. Experience in
high-performance computing (HPC), numerical computing, or ML inference optimization
. Experience with
ML model execution optimizations
, including efficient
tensor computations and memory access patterns
. Knowledge of
multi-threading, NUMA architectures, and low-level CPU optimization
. Proficiency with
systems-level software development
, profiling tools (perfetto, VTune, Valgrind), and benchmarking. Experience working with
hardware accelerators (FPGAs, GPUs, or custom ASICs)
and designing
efficient software-hardware interfaces
. Preferred Skills (Nice to Have)
Familiarity with
LLVM/Clang or GCC compiler optimizations
. Experience in
LLM quantization, sparsity optimizations, and mixed-precision computation
. Knowledge of
distributed inference techniques
and
networking optimizations
. Understanding of
graph partitioning and execution scheduling
for large-scale ML models. Why Join Us?
Work on a
cutting-edge ML inference platform
that redefines
performance and efficiency
for LLMs. Tackle
challenging low-level performance engineering problems
in AI and HPC. Collaborate with a team of
hardware, software, and ML experts
building an industry-first product. Opportunity to contribute to and shape the future of
open-source AI inference software
.
#J-18808-Ljbffr