Request Technology, LLC
Site Reliability Engineer
Hybrid (3 days onsite, 2 days remote) full‑time. No visa sponsorship. Base pay: $150,000 – $155,000 per year, subject to skills and experience.
A prestigious company seeks a Site Reliability Engineer focused on observation, logging, and capacity planning. The role requires experience with Linux, Kubernetes/Docker, Terraform, Jenkins, Ansible, Harness, and Kafka.
Responsibilities
- Collaborate with development, operations and infrastructure teams to ensure availability of services, and to work through implementation issues
- Develop automation for incident response and to prevent problem recurrence
- Create and enhance runbooks to respond to service outages or degradations
- Assess the production readiness of services
- Define and track operational metrics for production performance, reliability, scalability and availability
- Architect, develop and maintain shared services and tools to improve reliability and reduce toil across the organization
Qualifications
- Bachelor’s or Master’s Degrees in Computer Science, Information Systems or another related field, or equivalent work experience
- Minimum of 4+ years of experience in Site Reliability Engineering / DevOps
- Experience with maintaining and troubleshooting large‑scale distributed systems
- Experience managing infrastructure in public cloud environments like AWS (preferred), Azure or GCP
- Experience with AIOps and predictive analysis for anomaly detection, forecasting system capacity using monitoring and alerting tools like Splunk, AppDynamics, Datadog, StackDriver, Sysdig, Prometheus or Grafana
- Programming/scripting experience in languages like Java, Bash, Python or Go
- Experience with distributed messaging systems such as Kafka, RabbitMQ, or ActiveMQ
- Experience with container orchestration systems such as Kubernetes, Mesos, Docker Swarm or Rancher
- Experience with CI/CD tools such as Jenkins, Travis, Harness, Appveyor, CodeBuild or CodePipeline
- Familiarity with leveraging large language models (LLMs) to automate and optimize SRE workflows, including scripting, incident report summarization, or AI workload maintenance
Seniority Level
Mid‑Senior
#J-18808-Ljbffr