TalentOla
Location: Sunnyvale, CA (Local candidate)
Responsibilities
Ensure system reliability and availability – Monitor system issues, create strategies to detect issues, address those issues, design automated systems to troubleshoot, write and review post-mortems.
Mitigate Operational risks - Collaborate with development teams and other stakeholders to identify potential risks, perform risk assessments, implement risk mitigation strategies, continuously monitor and review the effectiveness of risk strategies.
Monitor system health.
Continuous improvement by collaborating with various teams.
Automation of processes.
Must have / required experience and skills
12+ years of experience on DevOps and Site Reliability Engineering.
Hands-on with containerization and orchestration: Docker, Kubernetes/EKS.
Proficiency in infrastructure as code tools: Terraform, Ansible, or CloudFormation.
Experience setting up and managing services running on Kubernetes.
In-depth understanding of SRE principles including monitoring, alerting, error budgets, fault analysis, and automation.
In-depth knowledge of monitoring and observability tools: Apache Splunk
Knowledge of Linux operating system principles, networking fundamentals, and systems management
Demonstrable fluency in at least one of the following languages: Java or Python
Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.
Building and managing CI/CD pipeline – gatekeeping production deployments, develop and implement GIT branching strategies, branch protection rules, network policies, scale up/ scale down the load on AWS.
Strong problem-solving and analytical skills
Solve performance issues and scalability issues in the system.
Technical Skills
DevOps and SRE
Terraform, Ansible, or CloudFormation
Programming/Scripting using Java or Python
CI/CD
Behavioural Skills
Excellent communication and collaboration skills
Ability to propose and implement improvements in the system
Ability to work with cross-functional stakeholders
Adaptability and a willingness to learn new technologies and techniques
Proactive approach to issues, ability to provide prompt resolution/work around
Seniority level
Mid-Senior level
Employment type
Contract
Job function
Information Technology
#J-18808-Ljbffr
Responsibilities
Ensure system reliability and availability – Monitor system issues, create strategies to detect issues, address those issues, design automated systems to troubleshoot, write and review post-mortems.
Mitigate Operational risks - Collaborate with development teams and other stakeholders to identify potential risks, perform risk assessments, implement risk mitigation strategies, continuously monitor and review the effectiveness of risk strategies.
Monitor system health.
Continuous improvement by collaborating with various teams.
Automation of processes.
Must have / required experience and skills
12+ years of experience on DevOps and Site Reliability Engineering.
Hands-on with containerization and orchestration: Docker, Kubernetes/EKS.
Proficiency in infrastructure as code tools: Terraform, Ansible, or CloudFormation.
Experience setting up and managing services running on Kubernetes.
In-depth understanding of SRE principles including monitoring, alerting, error budgets, fault analysis, and automation.
In-depth knowledge of monitoring and observability tools: Apache Splunk
Knowledge of Linux operating system principles, networking fundamentals, and systems management
Demonstrable fluency in at least one of the following languages: Java or Python
Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.
Building and managing CI/CD pipeline – gatekeeping production deployments, develop and implement GIT branching strategies, branch protection rules, network policies, scale up/ scale down the load on AWS.
Strong problem-solving and analytical skills
Solve performance issues and scalability issues in the system.
Technical Skills
DevOps and SRE
Terraform, Ansible, or CloudFormation
Programming/Scripting using Java or Python
CI/CD
Behavioural Skills
Excellent communication and collaboration skills
Ability to propose and implement improvements in the system
Ability to work with cross-functional stakeholders
Adaptability and a willingness to learn new technologies and techniques
Proactive approach to issues, ability to provide prompt resolution/work around
Seniority level
Mid-Senior level
Employment type
Contract
Job function
Information Technology
#J-18808-Ljbffr