Tandym Tech
Cloud SRE Lead – Major Incident and Digital Transformation
Tandym Tech, Reston, Virginia, United States, 22090
A recognized services company is actively seeking a new Cloud SRE Lead to join their team. In this role, the Cloud SRE Lead will be responsible for ensuring the reliability, scalability and performance of the company’s cloud infrastructure on Amazon Web Services (AWS) and guide the daily activities of the SRE team.
About the Opportunity:
Must be able to obtain and maintain the required agency clearance (6C Public Trust)
Responsibilities:
Execute ideation sessions across multiple teams and companies to identify areas of improvement and ideas to improve and radically change the current incident management process
Review of currently available tools and industry best-of-breed to recommend and champion the right tool and technology and the right capabilities to empower, visualize, communicate, and activate cross functional teams
Coordinate and lead the Major Incidents by directing the troubleshooting, communicating status, encouraging action, guiding the use of tools, and ensuring swift and complete resolution of the Major Incident
Schedule and lead blameless postmortems encouraging independent ideas, identification of true root causes, and communication of findings
Design, implement, and manage infrastructure as code (IaC) solutions using tools like AWS CloudFormation, Terraform or Helm Charts to automate deployment and scaling processes
Implement robust monitoring and alerting systems to proactively identify and address potential issues before they impact system performance
Conduct performance analysis and optimization of AWS infrastructure components to enhance system efficiency and reduce latency
Participate in on-call rotations to respond to and resolve incidents promptly
Work closely with security teams to implement and enforce best practices for securing AWS environments
Facilitate clear communication across teams, providing updates on release status, known issues, and any potential impact on stakeholders
Collaborate with development, QA, and operations teams to plan and coordinate software releases
Develop and maintain automated deployment pipelines using industry-standard tools such as AWS Cl/CD, GitLab CI/CD, Jenkins or similar
Qualifications:
5+ years of related work experience
Bachelor’s Degree
Proven experience as a Site Reliability Engineer or similar role
In-depth knowledge of AWS services and expertise in managing cloud infrastructure
Proven experience in a Digital Transformation role
Advanced level programming and/or scripting in 3 or more of the following languages: Python, Java, Chef, Helm, Playwright, Bash, JavaScript, Terraform.
Strong understanding of DevOps principles and continuous integration/continuous deployment (CI/CD) pipelines
Proficiency in CI/CD tools such as AWS CI/CD, GitLab CI/CD, or others
Familiarity with infrastructure as code (IaC) tools like CloudFormation, Terraform, Helm Charts, Morpheus, or similar technologies
Hands-on experience with version control systems (GitLab, AWS CodeCommit, SVN) and branching strategies
Experience with containerization and orchestration tools (e.g., Amazon Elastic Compute Service (ECS), Amazon Elastic Kubernetes Service (EKS), Docker, Kubernetes).
Familiarity with monitoring tools (e.g., CloudWatch, Prometheus, Grafana, Datadog, DynaTrace) and log analysis
Solid understanding of Agile methodologies and their application in release management and Cloud operations
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Desired Skills:
3+ years in SRE or Platform Engineering group for high availability/critical platforms/applications
Relevant certifications in DevOps or related fields
High Risk Public Trust or Secret Clearance
Experience managing a distributed container platform including but not limited to deployment/release management, provisioning, capacity management, workload management
#J-18808-Ljbffr
About the Opportunity:
Must be able to obtain and maintain the required agency clearance (6C Public Trust)
Responsibilities:
Execute ideation sessions across multiple teams and companies to identify areas of improvement and ideas to improve and radically change the current incident management process
Review of currently available tools and industry best-of-breed to recommend and champion the right tool and technology and the right capabilities to empower, visualize, communicate, and activate cross functional teams
Coordinate and lead the Major Incidents by directing the troubleshooting, communicating status, encouraging action, guiding the use of tools, and ensuring swift and complete resolution of the Major Incident
Schedule and lead blameless postmortems encouraging independent ideas, identification of true root causes, and communication of findings
Design, implement, and manage infrastructure as code (IaC) solutions using tools like AWS CloudFormation, Terraform or Helm Charts to automate deployment and scaling processes
Implement robust monitoring and alerting systems to proactively identify and address potential issues before they impact system performance
Conduct performance analysis and optimization of AWS infrastructure components to enhance system efficiency and reduce latency
Participate in on-call rotations to respond to and resolve incidents promptly
Work closely with security teams to implement and enforce best practices for securing AWS environments
Facilitate clear communication across teams, providing updates on release status, known issues, and any potential impact on stakeholders
Collaborate with development, QA, and operations teams to plan and coordinate software releases
Develop and maintain automated deployment pipelines using industry-standard tools such as AWS Cl/CD, GitLab CI/CD, Jenkins or similar
Qualifications:
5+ years of related work experience
Bachelor’s Degree
Proven experience as a Site Reliability Engineer or similar role
In-depth knowledge of AWS services and expertise in managing cloud infrastructure
Proven experience in a Digital Transformation role
Advanced level programming and/or scripting in 3 or more of the following languages: Python, Java, Chef, Helm, Playwright, Bash, JavaScript, Terraform.
Strong understanding of DevOps principles and continuous integration/continuous deployment (CI/CD) pipelines
Proficiency in CI/CD tools such as AWS CI/CD, GitLab CI/CD, or others
Familiarity with infrastructure as code (IaC) tools like CloudFormation, Terraform, Helm Charts, Morpheus, or similar technologies
Hands-on experience with version control systems (GitLab, AWS CodeCommit, SVN) and branching strategies
Experience with containerization and orchestration tools (e.g., Amazon Elastic Compute Service (ECS), Amazon Elastic Kubernetes Service (EKS), Docker, Kubernetes).
Familiarity with monitoring tools (e.g., CloudWatch, Prometheus, Grafana, Datadog, DynaTrace) and log analysis
Solid understanding of Agile methodologies and their application in release management and Cloud operations
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Desired Skills:
3+ years in SRE or Platform Engineering group for high availability/critical platforms/applications
Relevant certifications in DevOps or related fields
High Risk Public Trust or Secret Clearance
Experience managing a distributed container platform including but not limited to deployment/release management, provisioning, capacity management, workload management
#J-18808-Ljbffr