Sixtyfour
Infrastructure Engineer — Systems & Platform
Sixtyfour, San Francisco, California, United States, 94199
What You’ll Do
Design and maintain highly available, scalable infrastructure across AWS (ECS, EKS, Lambda, SQS, CloudFront, CloudWatch).
Architect automated CI/CD pipelines (GitHub Actions, Terraform) with strong testing, observability, and rollback safety.
Optimize LLM inference infrastructure, including autoscaling GPU/CPU clusters, caching, async queues, batching, and tracing.
Improve deployment workflows and environment consistency using Docker, IaC, and lightweight configuration management.
Work on backend performance, including queue throughput, caching strategies, database indexing, and load balancing.
Monitor, debug, and improve system reliability and latency across all services (API, inference, and web app).
Build internal tools that enhance developer productivity and operational visibility.
Partner with engineers to evolve the workflow and job execution engine for better parallelism, retry logic, and observability.
Set up metrics, tracing, and alerting (OpenTelemetry, Prometheus, Grafana, Sentry) to make reliability measurable and actionable.
Minimum Requirements
Strong experience with cloud infrastructure (AWS preferred) including EC2, ECS, EKS, Lambda, S3, VPCs, networking, and IAM.
Proficiency with Docker and CI/CD tools such as GitHub Actions or CircleCI.
Experience scaling Python backend systems and modern web APIs (FastAPI preferred).
Hands-on experience with API servers and background workers (Celery, Redis queues, etc.).
Comfort with Postgres and Redis, including schema design, caching, rate limiting, and locks.
Strong observability mindset, including logs, metrics, and traces.
Production experience with autoscaling, load testing, and cost‑aware resource optimization.
Excellent debugging and on‑call discipline with a focus on uptime and reliability.
Nice to have
Experience managing LLM serving infrastructure (OpenAI‑compatible APIs, vLLM, Triton, or similar).
Familiarity with Next.js and TypeScript to understand end‑to‑end deployment pipelines.
Experience with Terraform, Pulumi, or similar IaC tools.
Security‑focused mindset, including network boundaries, secret management, and RBAC.
Knowledge of real‑time systems (SSE or WebSockets) or stream processing.
Experience building developer platform tools or internal DevOps systems.
#J-18808-Ljbffr
Design and maintain highly available, scalable infrastructure across AWS (ECS, EKS, Lambda, SQS, CloudFront, CloudWatch).
Architect automated CI/CD pipelines (GitHub Actions, Terraform) with strong testing, observability, and rollback safety.
Optimize LLM inference infrastructure, including autoscaling GPU/CPU clusters, caching, async queues, batching, and tracing.
Improve deployment workflows and environment consistency using Docker, IaC, and lightweight configuration management.
Work on backend performance, including queue throughput, caching strategies, database indexing, and load balancing.
Monitor, debug, and improve system reliability and latency across all services (API, inference, and web app).
Build internal tools that enhance developer productivity and operational visibility.
Partner with engineers to evolve the workflow and job execution engine for better parallelism, retry logic, and observability.
Set up metrics, tracing, and alerting (OpenTelemetry, Prometheus, Grafana, Sentry) to make reliability measurable and actionable.
Minimum Requirements
Strong experience with cloud infrastructure (AWS preferred) including EC2, ECS, EKS, Lambda, S3, VPCs, networking, and IAM.
Proficiency with Docker and CI/CD tools such as GitHub Actions or CircleCI.
Experience scaling Python backend systems and modern web APIs (FastAPI preferred).
Hands-on experience with API servers and background workers (Celery, Redis queues, etc.).
Comfort with Postgres and Redis, including schema design, caching, rate limiting, and locks.
Strong observability mindset, including logs, metrics, and traces.
Production experience with autoscaling, load testing, and cost‑aware resource optimization.
Excellent debugging and on‑call discipline with a focus on uptime and reliability.
Nice to have
Experience managing LLM serving infrastructure (OpenAI‑compatible APIs, vLLM, Triton, or similar).
Familiarity with Next.js and TypeScript to understand end‑to‑end deployment pipelines.
Experience with Terraform, Pulumi, or similar IaC tools.
Security‑focused mindset, including network boundaries, secret management, and RBAC.
Knowledge of real‑time systems (SSE or WebSockets) or stream processing.
Experience building developer platform tools or internal DevOps systems.
#J-18808-Ljbffr