DRC Systems
Job Description
Site Reliability Engineer (SRE) role bridges software engineering and systems administration. Beyond ensuring the reliability and performance of platforms, the role also focuses on working with Development and Architecture teams to address quality gates, foundational architecture and stack components, metrics, trackers, baselines, and automated operations. Location: Dallas, Texas (Hybrid). Duration: Full‑time. Experience requirement: 10+ years.
Key Responsibilities
Automation: Automate tasks (scripts, triggers, workflow automations) for deployment, monitoring, and incident response to improve efficiency.
Monitoring and Observability: Design instrumentation, identify KPIs/metrics and events/alerting to track system health and preempt issues.
Incident Response: Respond to and resolve incidents exceeding L1/L2 thresholds, coordinate with L3 teams, minimize downtime, and follow up on problem backlogs and shift‑left initiatives.
Infrastructure as Code: Use Terraform, Ansible, or similar tools to manage infrastructure as code for repeatable, scalable deployments.
Collaboration: Work closely with architecture, development, QA, testing, and operations teams to understand system requirements and enhance overall resilience.
Problem‑Solving: Apply strong analytical skills to diagnose and resolve complex issues.
Communication: Translate technical details into actionable insights for both technical and non‑technical stakeholders.
Soft Skills: Demonstrate teamwork, time management, and proactive problem identification.
Technical Skills
Programming: Python, Java, C/C++, or Ruby, and IaC languages (Ansible, Terraform, Cloud‑Native).
Cloud Platforms: AWS, Azure, or GCP.
Containerization: Docker and Kubernetes.
Networking and System Administration.
CI/CD: Jenkins, Harness, or Spinnaker.
Qualifications
10+ years of relevant experience.
Mid‑Senior level; Full‑time commitment.
#J-18808-Ljbffr
Key Responsibilities
Automation: Automate tasks (scripts, triggers, workflow automations) for deployment, monitoring, and incident response to improve efficiency.
Monitoring and Observability: Design instrumentation, identify KPIs/metrics and events/alerting to track system health and preempt issues.
Incident Response: Respond to and resolve incidents exceeding L1/L2 thresholds, coordinate with L3 teams, minimize downtime, and follow up on problem backlogs and shift‑left initiatives.
Infrastructure as Code: Use Terraform, Ansible, or similar tools to manage infrastructure as code for repeatable, scalable deployments.
Collaboration: Work closely with architecture, development, QA, testing, and operations teams to understand system requirements and enhance overall resilience.
Problem‑Solving: Apply strong analytical skills to diagnose and resolve complex issues.
Communication: Translate technical details into actionable insights for both technical and non‑technical stakeholders.
Soft Skills: Demonstrate teamwork, time management, and proactive problem identification.
Technical Skills
Programming: Python, Java, C/C++, or Ruby, and IaC languages (Ansible, Terraform, Cloud‑Native).
Cloud Platforms: AWS, Azure, or GCP.
Containerization: Docker and Kubernetes.
Networking and System Administration.
CI/CD: Jenkins, Harness, or Spinnaker.
Qualifications
10+ years of relevant experience.
Mid‑Senior level; Full‑time commitment.
#J-18808-Ljbffr