Logo
Baseten

Software Engineer - Model API's

Baseten, San Francisco, California, United States, 94199

Save Job

Overview

Baseten powers inference for AI applications across dynamic teams and industries. We enable machines to run at the frontier of AI with integrated research, infrastructure, and developer tooling. This role is part of Baseten’s Model Performance (MP) team, focusing on Model APIs that power hosted endpoints for open‑source models. You will work at the intersection of product, model performance, and infra to shape how developers interact with AI models at scale. Base pay range

$150,000.00/yr - $230,000.00/yr Responsibilities

Design, build, and operate the Model APIs surface with a focus on advanced inference capabilities: structured outputs (JSON mode, grammar‑constrained generation), tool/function calling, and multi‑modal serving Profile and optimize TensorRT‑LLM kernels, analyze CUDA kernel performance, implement custom CUDA operators, and tune memory allocation for maximum throughput Productionize performance improvements across runtimes with deep understanding of internals: speculative decoding, guided generation for structured outputs, and custom scheduling and routing for high‑performance serving Build comprehensive benchmarking frameworks to measure real‑world performance across model architectures, batch sizes, sequence lengths, and hardware configurations Instrument deep observability (metrics, traces, logs) and build repeatable benchmarks to measure speed, reliability, and quality Implement platform fundamentals: API versioning, validation, usage metering, quotas, and authentication Collaborate with other teams to deliver robust, developer‑friendly model serving experiences Requirements

3+ years of experience building and operating distributed systems or large‑scale APIs Proven track record of owning low‑latency, reliable backend services (rate‑limiting, auth, quotas, metering, migrations) Infra instincts with performance sensibilities: profiling, tracing, capacity planning, and SLO management Comfortable debugging complex systems, from runtime internals to GPU execution traces Strong written communication; able to produce clear design docs and collaborate across functions Nice to have

Experience with LLM runtimes (vLLM, SGLang, TensorRT‑LLM) or contributions to open‑source inference engines (vLLM, TensorRT‑LLM, SGLang, TGI) Knowledge of Kubernetes, service meshes, API gateways, or distributed scheduling Background in developer‑facing infrastructure or open‑source APIs Infra‑leaning generalists with strong engineering fundamentals; ML experience is a plus but not required Benefits

Competitive compensation package Opportunity to be part of a rapidly growing startup in AI engineering Inclusive and supportive work culture fostering learning and growth Exposure to a variety of ML startups for learning and networking Apply now

to embark on a rewarding journey in shaping the future of AI. We value diverse and inclusive workplaces and provide equal employment opportunities. Compensation Range: $150K - $230K

#J-18808-Ljbffr