T-Mobile

Site Reliability Engineer

T-Mobile, Frisco, Texas, United States, 75034

At T-Mobile, we invest in YOU! Our Total Rewards Package ensures that employees get the same big love we give our customers. All team members receive a competitive base salary and compensation package – this is Total Rewards. Employees enjoy multiple wealth‑building opportunities through our annual stock grant, employee stock purchase plan, 401(k), and access to free, year‑round money coaches. That’s how we’re UNSTOPPABLE for our employees! Are you ready to join the Un‑carrier movement? The Site Reliability Engineer at T-Mobile is instrumental in enhancing system reliability and resilience, ensuring our digital infrastructure operates seamlessly. By automating processes and reducing manual efforts, they minimize operational incidents and streamline software development and deployment. Their proficiency in programming, scripting languages, and incident response management fortifies our systems against disruptions. Through continuous learning and adaptation to new technologies, they drive innovation and maintain system robustness. Their contributions are vital to the stability and performance of T-Mobile’s digital operations, directly impacting our service quality and operational efficiency.

Job Responsibilities :

Automates processes to enhance system reliability and resilience Own the reliability, scalability, and uptime of business‑critical systems and services.

Build self‑healing systems that automatically detect and recover from faults.

Implement redundancy, failover, and chaos testing strategies to validate system resilience.

Apply and refine SLOs/SLIs to measure and improve reliability outcomes.

Develop runbooks, automation scripts, and incident workflows that improve recovery time.

Integrate reliability best practices into the software development lifecycle.

Collaborate closely with Platform, Cloud, and Development teams to design for reliability.

Minimizes operational incidents through proactive monitoring and maintenance Maintain and improve service‑level objectives (SLOs) and error budgets for production systems.

Develop and enhance observability platforms (Prometheus, Grafana, CloudWatch, Azure Monitor) for deep visibility into system health.

Build and tune alerting systems to detect anomalies before they impact users.

Analyze system metrics, logs, and traces to identify performance bottlenecks.

Lead capacity planning and performance optimization efforts across multiple environments.

Streamlines software development and deployment processes Partner with operations and development teams to ensure service stability across hybrid infrastructure.

Design and implement automation for infrastructure provisioning and environment management using Terraform, Ansible, and Python.

Optimize CI/CD pipelines (GitLab CI/CD, Jenkins) to accelerate deployments while maintaining stability.

Develops scripts and tools to reduce manual efforts in operational tasks Maintain reusable automation modules and operational tooling for reliability improvements.

Manages incident response to ensure rapid recovery and minimal disruption Lead incident response and root cause analysis to minimize downtime and prevent recurrence.

Contribute to post‑incident reviews and reliability roadmaps to drive continuous improvement.

Adapts to new technologies to maintain and enhance system robustness Stay current with new SRE methodologies, tools, and infrastructure technologies.

Education and Work Experience :

Bachelor’s Degree in Computer Science or Engineering (Required)

Master’s or Advanced Degree in Computer Science or Data Science (Preferred)

2–4 years developing and maintaining CI/CD pipelines for software deployment (Required)

2–4 years implementing and managing cloud‑native platforms and solutions (Required)

2–4 years guiding and mentoring teams in reliability engineering practices (Required)

Knowledge, Skills and Abilities :

Problem Solving Ability to identify, analyze, and resolve system reliability issues. (Required)

Scripting Languages Proficiency in scripting languages such as Python or Bash to automate tasks and processes. (Required)

Incident Response Management Skilled in managing and responding to system incidents to minimize downtime and impact. (Required)

Licenses and Certifications :

Certified Kubernetes Administrator (CKA) – validates Kubernetes expertise. (Preferred)

AWS Certified DevOps Engineer – demonstrates expertise in provisioning, operating, and managing distributed application systems on AWS. (Preferred)

Site Reliability Engineering (SRE) Foundation Certification – foundational understanding of the SRE philosophy, practices, and tools. (Preferred)

Other Requirements :

At least 18 years of age

Legally authorized to work in the United States

T‑Mobile USA, Inc. is an Equal Opportunity Employer. All decisions concerning the employment relationship will be made without regard to age, race, ethnicity, color, religion, creed, sex, sexual orientation, gender identity or expression, national origin, religious affiliation, marital status, citizenship status, veteran status, the presence of any physical or mental disability, or any other status or characteristic protected by federal, state, or local law. Discrimination, retaliation or harassment based upon any of these factors is wholly inconsistent with how we do business and will not be tolerated.

If you are an individual with a disability and need reasonable accommodation at any point in the application or interview process, please let us know by emailing ApplicantAccommodation@t‑mobile.com or calling 1‑844‑873‑9500. This contact channel is not a means to apply for or inquire about a position, and we are unable to respond to non‑accommodation related requests.

#J-18808-Ljbffr