Cox Automotive Inc.
Lead Site Reliability Engineer – Cox Automotive Inc.
Join to apply for the Lead Site Reliability Engineer role at Cox Automotive Inc.
The Lead Site Reliability Engineer will be part of the Site Reliability Engineering (SRE) team. The SRE team drives reliability, observability, and engineering practice maturity across over 150 teams made up of over a thousand engineers in our part of Cox Automotive. We build processes, documentation, and tools that scale: deep observability to detect and diagnose issues faster, engineering maturity assessments that drive measurable improvement, reusable golden paths that accelerate delivery, and trusted advisory relationships that align reliability with business priorities. Much of our work focuses on eliminating toil through automation and establishing self-service capabilities that multiply our impact.
Responsibilities
Define and drive adoption of SLIs, SLOs, error budgets, and high-quality alerting standards across the organization
Architect end-to-end observability strategies (metrics, logs, traces, business signals) with consistent taxonomy and discoverability
Build centralized dashboards, reliability scorecards, and runbooks used by engineering teams and leadership
Establish engineering practice maturity baselines and partner with teams on measurable improvement plans
Create golden paths-standardized pipelines, infrastructure modules, and service templates-that enable rapid, consistent delivery
Lead internal workshops, game days, and learning programs to spread operational excellence
Act as a trusted advisor to product and engineering leadership, providing data-driven insights on reliability risk and trade-offs
Guide post-incident reviews toward systemic remediation (guardrails, automation, design changes) rather than superficial fixes
Design and extend self-service platforms for deployment, progressive delivery, and automated recovery
Reduce MTTR through better telemetry, automation, and resilience patterns
Mentor engineers across teams to become local reliability champions, scaling SRE impact without adding headcount
Qualifications
Experience programming in at least one of the following languages: Python, Typescript, or Java.
Bachelor's degree in a related discipline and 6 years' experience in a related field. The right candidate could also have a different combination, such as a master's degree and 4 years' experience; a Ph.D. and 1 year of experience; or 18 years' experience in a related field.
Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in future.
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
Deep hands‑on experience with modern observability tools (CloudWatch and NewRelic)
Proven ability to assess engineering practices and drive measurable improvements across multiple teams.
Experience establishing SLIs/SLOs, managing error budgets, and improving alert signal‑to‑noise ratios.
Strong background in release engineering, CI/CD, and progressive deployment strategies.
Deep expertise in AWS, Terraform, AWS CDK, and GitHub/GitHub Actions.
Track record reducing MTTR and improving availability through automation and architectural improvements.
Excellent written and verbal communication skills tailored to both engineers and executives.
Systematic problem‑solving approach with a sense of drive and ownership.
Understanding of Linux operating systems, networking, and performance fundamentals.
Ability to build trust and influence decisions through data‑driven insights.
Experience facilitating effective post‑incident analysis and driving systemic remediation.
Desire to work in a fast‑paced, evolving, growing, dynamic environment.
Compensation Base salary: $119,600.00 – $199,400.00 per year. The base salary may vary within the anticipated range based on location and selected candidate's qualifications. Position may be eligible for additional compensation that may include an incentive program.
Benefits
Paid vacation with pay as deemed consistent with duties, the company's needs, and obligations.
Seven paid holidays each year.
Up to 160 hours of paid wellness annually for employee or family members.
Additional paid time off: bereavement leave, leave to vote, jury duty leave, volunteer time off, military leave, and parental leave.
#J-18808-Ljbffr
The Lead Site Reliability Engineer will be part of the Site Reliability Engineering (SRE) team. The SRE team drives reliability, observability, and engineering practice maturity across over 150 teams made up of over a thousand engineers in our part of Cox Automotive. We build processes, documentation, and tools that scale: deep observability to detect and diagnose issues faster, engineering maturity assessments that drive measurable improvement, reusable golden paths that accelerate delivery, and trusted advisory relationships that align reliability with business priorities. Much of our work focuses on eliminating toil through automation and establishing self-service capabilities that multiply our impact.
Responsibilities
Define and drive adoption of SLIs, SLOs, error budgets, and high-quality alerting standards across the organization
Architect end-to-end observability strategies (metrics, logs, traces, business signals) with consistent taxonomy and discoverability
Build centralized dashboards, reliability scorecards, and runbooks used by engineering teams and leadership
Establish engineering practice maturity baselines and partner with teams on measurable improvement plans
Create golden paths-standardized pipelines, infrastructure modules, and service templates-that enable rapid, consistent delivery
Lead internal workshops, game days, and learning programs to spread operational excellence
Act as a trusted advisor to product and engineering leadership, providing data-driven insights on reliability risk and trade-offs
Guide post-incident reviews toward systemic remediation (guardrails, automation, design changes) rather than superficial fixes
Design and extend self-service platforms for deployment, progressive delivery, and automated recovery
Reduce MTTR through better telemetry, automation, and resilience patterns
Mentor engineers across teams to become local reliability champions, scaling SRE impact without adding headcount
Qualifications
Experience programming in at least one of the following languages: Python, Typescript, or Java.
Bachelor's degree in a related discipline and 6 years' experience in a related field. The right candidate could also have a different combination, such as a master's degree and 4 years' experience; a Ph.D. and 1 year of experience; or 18 years' experience in a related field.
Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in future.
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
Deep hands‑on experience with modern observability tools (CloudWatch and NewRelic)
Proven ability to assess engineering practices and drive measurable improvements across multiple teams.
Experience establishing SLIs/SLOs, managing error budgets, and improving alert signal‑to‑noise ratios.
Strong background in release engineering, CI/CD, and progressive deployment strategies.
Deep expertise in AWS, Terraform, AWS CDK, and GitHub/GitHub Actions.
Track record reducing MTTR and improving availability through automation and architectural improvements.
Excellent written and verbal communication skills tailored to both engineers and executives.
Systematic problem‑solving approach with a sense of drive and ownership.
Understanding of Linux operating systems, networking, and performance fundamentals.
Ability to build trust and influence decisions through data‑driven insights.
Experience facilitating effective post‑incident analysis and driving systemic remediation.
Desire to work in a fast‑paced, evolving, growing, dynamic environment.
Compensation Base salary: $119,600.00 – $199,400.00 per year. The base salary may vary within the anticipated range based on location and selected candidate's qualifications. Position may be eligible for additional compensation that may include an incentive program.
Benefits
Paid vacation with pay as deemed consistent with duties, the company's needs, and obligations.
Seven paid holidays each year.
Up to 160 hours of paid wellness annually for employee or family members.
Additional paid time off: bereavement leave, leave to vote, jury duty leave, volunteer time off, military leave, and parental leave.
#J-18808-Ljbffr