Logo
National Oilwell Varco

Site Reliability Engineer

National Oilwell Varco, Houston, Texas, United States, 77246

Save Job

As a Site Reliability Engineer, you will be responsible for: Operational Excellence & Incident Management - Maintain and monitor production systems for availability, latency, and performance. - Lead incident response efforts, including communication, resolution, and postmortem documentation. - Design and implement health checks, alerting systems, and automated remediation workflows. - Drive root cause analysis and implement permanent resolutions for recurring issues. Observability & Insights - Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK. - Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement. - Conduct post-incident reviews and use insights to inform future engineering investments. Performance & Systems Optimization - Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency. - Work with developers to evolve architecture and improve system throughput, latency, and stability. - Optimize PostgreSQL performance, queries, and maintenance strategies. CI/CD & Automation - Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI. - Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency. - Standardize infrastructure as code practices across environments. Education and Experience - 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles. - Bachelor’s degree in information technology, Computer Science, or a related - Expertise in Kubernetes and container orchestration at scale. - Strong experience with AKKA.NET or similar actor-based frameworks. - Proficiency with scripting and automation (Bash, PowerShell, Python). - Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK). - Hands-on experience with cloud platforms (AWS, Azure, or GCP). - Strong PostgreSQL knowledge—performance tuning, query optimization, maintenance. - Proven ability to lead incident management and drive postmortem processes. - A builder’s mindset with high standards for operational excellence and technical ownership. Preferred Tools & Ecosystem Experience - CI/CD: GitHub Actions, Azure Pipelines, GitLab CI - Infrastructure: Kubernetes, Docker, Terraform - Monitoring: Phobos (AKKA.NET), Datadog, Prometheus - Source Control: GitHub, GitLab, Azure DevOps - Programming: C#, Python, Bash, PowerShell

#J-18808-Ljbffr