Logo
Request Technology, LLC

Site Reliability Engineer

Request Technology, LLC, WorkFromHome

Save Job

Site Reliability Engineer

Hybrid (3 days onsite, 2 days remote) full‑time. No visa sponsorship. Base pay: $150,000 – $155,000 per year, subject to skills and experience.

A prestigious company seeks a Site Reliability Engineer focused on observation, logging, and capacity planning. The role requires experience with Linux, Kubernetes/Docker, Terraform, Jenkins, Ansible, Harness, and Kafka.

Responsibilities

  • Collaborate with development, operations and infrastructure teams to ensure availability of services, and to work through implementation issues
  • Develop automation for incident response and to prevent problem recurrence
  • Create and enhance runbooks to respond to service outages or degradations
  • Assess the production readiness of services
  • Define and track operational metrics for production performance, reliability, scalability and availability
  • Architect, develop and maintain shared services and tools to improve reliability and reduce toil across the organization

Qualifications

  • Bachelor’s or Master’s Degrees in Computer Science, Information Systems or another related field, or equivalent work experience
  • Minimum of 4+ years of experience in Site Reliability Engineering / DevOps
  • Experience with maintaining and troubleshooting large‑scale distributed systems
  • Experience managing infrastructure in public cloud environments like AWS (preferred), Azure or GCP
  • Experience with AIOps and predictive analysis for anomaly detection, forecasting system capacity using monitoring and alerting tools like Splunk, AppDynamics, Datadog, StackDriver, Sysdig, Prometheus or Grafana
  • Programming/scripting experience in languages like Java, Bash, Python or Go
  • Experience with distributed messaging systems such as Kafka, RabbitMQ, or ActiveMQ
  • Experience with container orchestration systems such as Kubernetes, Mesos, Docker Swarm or Rancher
  • Experience with CI/CD tools such as Jenkins, Travis, Harness, Appveyor, CodeBuild or CodePipeline
  • Familiarity with leveraging large language models (LLMs) to automate and optimize SRE workflows, including scripting, incident report summarization, or AI workload maintenance

Seniority Level

Mid‑Senior

#J-18808-Ljbffr