Traversal
Overview
About Traversal Traversal is the AI Site Reliability Engineer (SRE) for the enterprise—trusted by some of the world\'s largest companies to troubleshoot, remediate, and prevent complex production incidents. Our mission is to free engineers from firefighting and enable them to focus on creative, high-impact work. We are building the premier AI agent lab for the enterprise, combining AI research rigor with real-world production experience. We are assembling a talented team from institutions and companies including MIT, Harvard, Berkeley, Citadel Securities, Cockroach Labs, Cerebras Systems, Glean, Nuro, Perplexity, Pinecone, and more. This role reports into our growing AI Platform and SRE efforts. The Role
As an AI Site Reliability Researcher, you\'ll play a central role in ensuring the scalability, reliability, and observability of our AI platform. This is a high-impact, cross-functional role where you\'ll design systems and processes that keep our AI-driven infrastructure healthy and performant. We\'re entering a phase of rapid growth and scale, driven by the needs of large enterprise customers. That means pressure on deployments and developer workflows. We\'re building our own distributed systems, maturing our CI / CD pipelines, and managing complex hybrid environments (SaaS and on-prem). You\'ll play a foundational role in establishing SRE practices to scale thoughtfully and reliably. In this role, you\'ll define how we do change management across diverse deployment environments, build internal observability from the ground up, and help bring structure to systems evolving quickly. You\'ll also be a hands-on user of Traversal—your feedback will shape the product directly. You\'ll collaborate closely with our infra and AI agent teams, with opportunities to influence how AI integrates with real-world production environments.
Responsibilities
Brains Of The Product: Distilling SRE knowledge into agentic workflows. System Design & Architecture: Build scalable and resilient infrastructure to support AI observability agents in both cloud and on-prem environments. Observability: Build systems to monitor logs, metrics, and traces tied to deployments and developer activity. Power user of observability tools. Incident Management: Define and lead on-call and incident response processes, including alerting, debugging, and postmortems. CI / CD & Deployment: Design and scale our in-house CI / CD systems to support safe, efficient rollouts across hybrid environments. Infrastructure Automation: Own our infrastructure-as-code stack and improve automation across deployment and provisioning workflows.
Qualifications
Experience as an SRE, infrastructure engineer or similar role in fast-paced environments. Exceptional debugging skills across complex, distributed systems — proven ability to get to root cause quickly across varied tech stacks. Strong systems design intuition — understands how observability tools fit into architecture and how to leverage them effectively in incident response. Experience with observability tools (e.g., Datadog, Grafana, Prometheus, OpenTelemetry) and incident response. Deep understanding of infrastructure automation and CI / CD systems. Hands-on experience with Terraform, Kubernetes, and cloud environments (AWS or GCP). Ability to debug distributed systems and drive system-level improvements. Experience supporting hybrid cloud / on-prem deployments and complex change management.
Nice to Have
Familiarity with AI infrastructure or supporting ML / LLM workloads in production. Background in developer productivity tooling or internal platform teams. Prior experience building systems that connect infra events to developer workflows. Exposure to agentic systems or AI observability platforms.
Compensation
We offer competitive compensation, startup equity, health insurance, and additional benefits. The U.S. base salary range for this full-time, in-person role in New York is $150,000–$300,000, plus equity and benefits. Our salary ranges are based on location, level, and role. Individual compensation is determined by experience, skills, and job-related knowledge.
Why You Should Join Us
We’ll make sure you\'re fully supported with health insurance, a great tech setup, flexible time off, and plenty of in-office snacks. We offer competitive salary and equity packages, and thoughtful consideration with every hire on our small, high-impact team. Traversal is fully in-office, 5 days a week, based in New York near Madison Square Park. We have a collaborative, hard-working culture and are energized by building the future of AI-powered software maintenance. Working here means owning meaningful parts of the product, having the flexibility to move fast, and learning constantly. This is a place to grow your career, make a real impact, and help define a new category of infrastructure software.
#J-18808-Ljbffr
About Traversal Traversal is the AI Site Reliability Engineer (SRE) for the enterprise—trusted by some of the world\'s largest companies to troubleshoot, remediate, and prevent complex production incidents. Our mission is to free engineers from firefighting and enable them to focus on creative, high-impact work. We are building the premier AI agent lab for the enterprise, combining AI research rigor with real-world production experience. We are assembling a talented team from institutions and companies including MIT, Harvard, Berkeley, Citadel Securities, Cockroach Labs, Cerebras Systems, Glean, Nuro, Perplexity, Pinecone, and more. This role reports into our growing AI Platform and SRE efforts. The Role
As an AI Site Reliability Researcher, you\'ll play a central role in ensuring the scalability, reliability, and observability of our AI platform. This is a high-impact, cross-functional role where you\'ll design systems and processes that keep our AI-driven infrastructure healthy and performant. We\'re entering a phase of rapid growth and scale, driven by the needs of large enterprise customers. That means pressure on deployments and developer workflows. We\'re building our own distributed systems, maturing our CI / CD pipelines, and managing complex hybrid environments (SaaS and on-prem). You\'ll play a foundational role in establishing SRE practices to scale thoughtfully and reliably. In this role, you\'ll define how we do change management across diverse deployment environments, build internal observability from the ground up, and help bring structure to systems evolving quickly. You\'ll also be a hands-on user of Traversal—your feedback will shape the product directly. You\'ll collaborate closely with our infra and AI agent teams, with opportunities to influence how AI integrates with real-world production environments.
Responsibilities
Brains Of The Product: Distilling SRE knowledge into agentic workflows. System Design & Architecture: Build scalable and resilient infrastructure to support AI observability agents in both cloud and on-prem environments. Observability: Build systems to monitor logs, metrics, and traces tied to deployments and developer activity. Power user of observability tools. Incident Management: Define and lead on-call and incident response processes, including alerting, debugging, and postmortems. CI / CD & Deployment: Design and scale our in-house CI / CD systems to support safe, efficient rollouts across hybrid environments. Infrastructure Automation: Own our infrastructure-as-code stack and improve automation across deployment and provisioning workflows.
Qualifications
Experience as an SRE, infrastructure engineer or similar role in fast-paced environments. Exceptional debugging skills across complex, distributed systems — proven ability to get to root cause quickly across varied tech stacks. Strong systems design intuition — understands how observability tools fit into architecture and how to leverage them effectively in incident response. Experience with observability tools (e.g., Datadog, Grafana, Prometheus, OpenTelemetry) and incident response. Deep understanding of infrastructure automation and CI / CD systems. Hands-on experience with Terraform, Kubernetes, and cloud environments (AWS or GCP). Ability to debug distributed systems and drive system-level improvements. Experience supporting hybrid cloud / on-prem deployments and complex change management.
Nice to Have
Familiarity with AI infrastructure or supporting ML / LLM workloads in production. Background in developer productivity tooling or internal platform teams. Prior experience building systems that connect infra events to developer workflows. Exposure to agentic systems or AI observability platforms.
Compensation
We offer competitive compensation, startup equity, health insurance, and additional benefits. The U.S. base salary range for this full-time, in-person role in New York is $150,000–$300,000, plus equity and benefits. Our salary ranges are based on location, level, and role. Individual compensation is determined by experience, skills, and job-related knowledge.
Why You Should Join Us
We’ll make sure you\'re fully supported with health insurance, a great tech setup, flexible time off, and plenty of in-office snacks. We offer competitive salary and equity packages, and thoughtful consideration with every hire on our small, high-impact team. Traversal is fully in-office, 5 days a week, based in New York near Madison Square Park. We have a collaborative, hard-working culture and are energized by building the future of AI-powered software maintenance. Working here means owning meaningful parts of the product, having the flexibility to move fast, and learning constantly. This is a place to grow your career, make a real impact, and help define a new category of infrastructure software.
#J-18808-Ljbffr