TalentOla
Location: Sunnyvale, CA (Local candidate)
Responsibilities
- Ensure system reliability and availability – Monitor system issues, create strategies to detect issues, address those issues, design automated systems to troubleshoot, write and review post-mortems.
- Mitigate Operational risks - Collaborate with development teams and other stakeholders to identify potential risks, perform risk assessments, implement risk mitigation strategies, continuously monitor and review the effectiveness of risk strategies.
- Monitor system health.
- Continuous improvement by collaborating with various teams.
- Automation of processes.
Must have / required experience and skills
- 12+ years of experience on DevOps and Site Reliability Engineering.
- Hands-on with containerization and orchestration: Docker, Kubernetes/EKS.
- Proficiency in infrastructure as code tools: Terraform, Ansible, or CloudFormation.
- Experience setting up and managing services running on Kubernetes.
- In-depth understanding of SRE principles including monitoring, alerting, error budgets, fault analysis, and automation.
- In-depth knowledge of monitoring and observability tools: Apache Splunk
- Knowledge of Linux operating system principles, networking fundamentals, and systems management
- Demonstrable fluency in at least one of the following languages: Java or Python
- Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.
- Building and managing CI/CD pipeline – gatekeeping production deployments, develop and implement GIT branching strategies, branch protection rules, network policies, scale up/ scale down the load on AWS.
- Strong problem-solving and analytical skills
- Solve performance issues and scalability issues in the system.
Technical Skills
- DevOps and SRE
- Terraform, Ansible, or CloudFormation
- Programming/Scripting using Java or Python
- CI/CD
Behavioural Skills
- Excellent communication and collaboration skills
- Ability to propose and implement improvements in the system
- Ability to work with cross-functional stakeholders
- Adaptability and a willingness to learn new technologies and techniques
- Proactive approach to issues, ability to provide prompt resolution/work around
Seniority level
- Mid-Senior level
Employment type
- Contract
Job function
- Information Technology