innovitusa
Hiring: W2 Candidates Only
Visa: Open to any visa type with valid work authorization in the USA
We are seeking a highly skilled Site Reliability Engineer (SRE) to build scale and maintain our production infrastructure. The ideal candidate blends software engineering expertise with strong operational discipline. You will ensure the reliability, availability, security and performance of our cloud-based systems while driving automation and continuous improvement across engineering teams.
Key Responsibilities
Design, build and manage highly scalable and reliable infrastructure across cloud environments (AWS / Azure / GCP).
Develop automation for deployment, monitoring, scaling and recovery using tools such as Terraform, Ansible, Helm or CloudFormation.
Implement CI / CD pipelines and partner with development teams to enhance deployment velocity and operational stability.
Monitor system performance using tools like Prometheus, Grafana, Datadog, ELK Stack or CloudWatch.
Perform incident response, root cause analysis (RCA) and postmortems to ensure continuous improvement.
Build and maintain robust alerting systems and SLO / SLIs to uphold service-level reliability targets.
Improve system resilience through capacity planning, chaos engineering, fault‑tolerance testing and disaster recovery strategies.
Maintain and enhance security posture, ensure compliance and enforce operational best practices.
Manage containers and orchestration platforms such as Docker and Kubernetes at scale.
Collaborate with cross‑functional teams to drive reliability, performance tuning and cost optimization.
Required Skills & Qualifications
Bachelors degree in Computer Science, Engineering or a related technical field.
4‑8 years of SRE, DevOps or Cloud Engineering experience.
Strong proficiency in cloud platforms: AWS, Azure or GCP.
Expertise with infrastructure‑as‑code tools (Terraform, CloudFormation, Pulumi, Ansible).
Hands‑on experience with Kubernetes, Docker and container orchestration.
Strong scripting / programming skills in Python, Go, Bash or similar.
Solid understanding of networking fundamentals (DNS, TCP/IP, Load Balancing, VPC).
Experience with monitoring, log management and observability tools.
Strong problem‑solving, debugging and troubleshooting skills in large‑scale distributed systems.
Good communication skills and ability to work in fast‑paced collaborative environments.
Preferred Qualifications
Experience supporting microservices‑based architectures.
Knowledge of serverless technologies (Lambda, GCP Cloud Functions, Azure Functions).
Experience with GitOps tools (ArgoCD, Flux).
Background in security hardening, compliance or cloud architecture.
Familiarity with chaos engineering tools (Gremlin, LitmusChaos).
Experience in on‑call rotations with strong incident management skills.
Key Skills Kubernetes,FMEA,Continuous Improvement,Elasticsearch,Go,Root cause Analysis,Maximo,CMMS,Maintenance,Mechanical Engineering,Manufacturing,Troubleshooting
#J-18808-Ljbffr
Visa: Open to any visa type with valid work authorization in the USA
We are seeking a highly skilled Site Reliability Engineer (SRE) to build scale and maintain our production infrastructure. The ideal candidate blends software engineering expertise with strong operational discipline. You will ensure the reliability, availability, security and performance of our cloud-based systems while driving automation and continuous improvement across engineering teams.
Key Responsibilities
Design, build and manage highly scalable and reliable infrastructure across cloud environments (AWS / Azure / GCP).
Develop automation for deployment, monitoring, scaling and recovery using tools such as Terraform, Ansible, Helm or CloudFormation.
Implement CI / CD pipelines and partner with development teams to enhance deployment velocity and operational stability.
Monitor system performance using tools like Prometheus, Grafana, Datadog, ELK Stack or CloudWatch.
Perform incident response, root cause analysis (RCA) and postmortems to ensure continuous improvement.
Build and maintain robust alerting systems and SLO / SLIs to uphold service-level reliability targets.
Improve system resilience through capacity planning, chaos engineering, fault‑tolerance testing and disaster recovery strategies.
Maintain and enhance security posture, ensure compliance and enforce operational best practices.
Manage containers and orchestration platforms such as Docker and Kubernetes at scale.
Collaborate with cross‑functional teams to drive reliability, performance tuning and cost optimization.
Required Skills & Qualifications
Bachelors degree in Computer Science, Engineering or a related technical field.
4‑8 years of SRE, DevOps or Cloud Engineering experience.
Strong proficiency in cloud platforms: AWS, Azure or GCP.
Expertise with infrastructure‑as‑code tools (Terraform, CloudFormation, Pulumi, Ansible).
Hands‑on experience with Kubernetes, Docker and container orchestration.
Strong scripting / programming skills in Python, Go, Bash or similar.
Solid understanding of networking fundamentals (DNS, TCP/IP, Load Balancing, VPC).
Experience with monitoring, log management and observability tools.
Strong problem‑solving, debugging and troubleshooting skills in large‑scale distributed systems.
Good communication skills and ability to work in fast‑paced collaborative environments.
Preferred Qualifications
Experience supporting microservices‑based architectures.
Knowledge of serverless technologies (Lambda, GCP Cloud Functions, Azure Functions).
Experience with GitOps tools (ArgoCD, Flux).
Background in security hardening, compliance or cloud architecture.
Familiarity with chaos engineering tools (Gremlin, LitmusChaos).
Experience in on‑call rotations with strong incident management skills.
Key Skills Kubernetes,FMEA,Continuous Improvement,Elasticsearch,Go,Root cause Analysis,Maximo,CMMS,Maintenance,Mechanical Engineering,Manufacturing,Troubleshooting
#J-18808-Ljbffr