Entomo

Sr. Site Reliability Engineer

Entomo, Houston, Texas, United States, 77246

Join us to build enterprises of tomorrow Sr. Site Reliability Engineer Job Description We are seeking a skilled

Site Reliability Engineer (SRE)

to join our team. In this role, you will be responsible for bridging the gap between development and operations by applying software engineering principles to infrastructure and operations tasks. Your primary focus will be ensuring the reliability, availability, performance, and scalability of our production systems while minimizing manual operational work through automation and enhancing system resilience.

Position Overview The Site Reliability Engineer will work closely with development and operations teams to design, implement, and maintain highly reliable systems. You will be instrumental in establishing best practices for observability, incident response, and infrastructure management. Your expertise will help reduce operational overhead, improve system performance, and ensure seamless deployments through CI/CD pipelines.

Qualifications Required Skills and Experience

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience

3+ years of experience in SRE, DevOps, or similar roles

Strong proficiency with Kubernetes (K8s) and Docker containerization

Experience with the ELK stack (Elasticsearch, Logstash, Kibana) for logging and monitoring

Good to have: Understanding of Java programming and Java application troubleshooting

Working knowledge of SQL and MongoDB databases

Familiarity with Angular for frontend monitoring and diagnostic tooling

Strong understanding of system architecture, cloud infrastructure, and networking

Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible)

Experience with monitoring and observability platforms

Excellent problem-solving skills and ability to troubleshoot complex systems

Strong verbal and written communication skills

Preferred Skills

Must have: Experience with AWS public cloud

Good to have: Knowledge of Azure and GCP

Familiarity with CI/CD tools (Jenkins, GitLab CI, GitHub Actions)

Understanding of service mesh technologies (e.g., Istio)

Experience with scripting languages (Python, Bash)

Understanding of distributed systems and microservices

Experience implementing SLOs, SLIs, and SLAs

Awareness of security best practices

Certifications in relevant technologies (e.g., CKA, AWS Certified)

Roles and Responsibilities System Reliability and Performance

Design, implement, and maintain highly available and scalable infrastructure

Define and track SLOs, SLIs, and error budgets

Conduct capacity planning and optimize performance

Improve system resilience and fault tolerance

Perform regular health checks and proactive maintenance

Monitoring and Observability

Deploy and maintain monitoring solutions (e.g., ELK stack)

Build dashboards for system metrics, logs, and app performance

Set up alerting systems to reduce alert fatigue

Implement distributed tracing and ensure service telemetry

Maintain comprehensive logging across systems

Incident Management and Response

Lead incident response, including mitigation and resolution

Conduct root cause analysis and post-incident reviews

Maintain incident runbooks and knowledge base

Participate in on‑call rotation for critical systems

Automation and Toil Reduction

Identify and automate repetitive operational tasks

Implement Infrastructure as Code for consistent provisioning

Automate testing and deployment processes

Design and maintain reliable CI/CD pipelines

Implement automated testing within workflows

Support canary deployments, feature flagging, and rollback strategies

Infrastructure Management

Manage Kubernetes clusters and containerized applications

Oversee config management and version control

Implement infrastructure security and compliance

Optimize resources and ensure backup/disaster recovery

Collaboration and Knowledge Sharing

Partner with development teams to enhance reliability

Provide architectural guidance with an SRE lens

Conduct documentation and knowledge‑sharing sessions

Promote SRE best practices across the organization

Collaborative, improvement‑driven team culture

Exposure to cutting‑edge technologies

Balance of project and operational responsibilities

Focus on automation, innovation, and resilience

Strong emphasis on learning and growth

Success Metrics

Improved system availability and reliability

Reduction in MTTD and MTTR

Fewer production incidents and outages

Increased automation and reduced manual effort

Successful SLO implementation and monitoring coverage

Positive feedback from dev teams on SRE support

Transform people experience in your enterprise of tomorrow

#J-18808-Ljbffr