Traversal

AI Engineer - Infrastructure

Traversal, New York, New York, us, 10261

Overview

Join to apply for the AI Engineer - Infrastructure role at Traversal. Traversal is the AI Site Reliability Engineer (SRE) for the enterprise, trusted to troubleshoot, remediate, and prevent complex production incidents. Our mission is to free engineers from firefighting and enable them to focus on high-impact work. We design an AI agent lab for the enterprise with a strong AI research background. The Role: As an AI Infrastructure Engineer on the Platform / Reliability team, you will design, secure, and operate the core systems that power Traversal’s AI products. We serve Fortune 50 enterprises with multi-tenancy and SOC 2 Type II controls, and we are rapidly scaling. Responsibilities

System Design & Architecture: Design scalable, reliable infrastructure for AI inference, data pipelines, and agentic workflows. Queue & Job Scheduling (K8s-native): Migrate from Python multiprocessing + Postgres-as-queue to Kubernetes-native queuing and orchestration (KEDA/HPA, Jobs/CronJobs, Kueue/Argo). Managed Kafka Operations: Tune partitioning and throughput, design DLQ + replay runbooks, implement idempotent sinks to avoid duplicates. Autoscaling: Scale on real signals (queue lag, in-flight requests, latency); add burst capacity and safe drains. Per-Tool Reliability: Productionize MCP toolchains with circuit breaking, timeouts, sandboxing, and auditability. Progressive Delivery: Implement canary and blue/green rollouts for stateful services, pre-warm caches/weights, and enable graceful termination. Observability: Build RED/USE dashboards and OpenTelemetry traces across gateway → agent → tool → Kafka → sinks. Infrastructure as Code: Evolve Terraform/Helm/Kustomize for multi-environment deployments, secrets, policy-as-code (OPA/Rego), and workload identity. Qualifications

3+ years of experience at technically rigorous companies or teams. Proven experience operating high-concurrency backends with managed Kafka fan-in/out and at-least-once processing. Experience designing idempotent systems (outbox, dedupe keys, safe replay). Production experience building and maintaining systems in Python and Rust (Rust 2024). Incident response, chaos testing, capacity planning. Familiarity with AWS, EKS, Terraform, Helm/Kustomize. Strong debugging skills across runtime, Kafka, network, and auth layers. Security-minded, with experience implementing least privilege, default-deny egress, auditability, and policy-as-code. Nice to Have

GPU workload operations (MIG, topology-aware placement), inference servers, token streaming gateways. Data governance (PII discovery/redaction), lineage, tokenization. Cross-region active/active for Kafka and stateless services. Service mesh (Envoy/Istio), Cilium/eBPF, ClickHouse for analytics. Compensation & Benefits

We offer competitive compensation, startup equity, health insurance, and additional benefits. The U.S. base salary range for this full-time, in-person role in New York is $150,000–$300,000, plus equity and benefits. Salaries based on location, level, and role; individual compensation determined by experience, skills, and job-related knowledge. Why You Should Join Us

We provide health insurance, a great tech setup, flexible time off, and in-office amenities. We offer competitive salary and equity packages, and a thoughtful, high-impact team environment. Traversal operates in-office, based in New York near Madison Square Park. Joining us means owning meaningful parts of the product, moving fast, and constant learning — a place to grow your career and help define a new category of infrastructure software.

#J-18808-Ljbffr