Sibitalent Corp
Senior Site Reliability Engineer
Sibitalent Corp, San Francisco, California, United States, 94199
Job Title:
Staff Site Reliability Engineer (SRE)
Location:
San Francisco, CA (Hybrid, Local Only)
Duration:
6+ months Contract
Profile:
12+ Years of experience
Employment Type:
W2 OR C2C (Either will work)
Job Description As our Staff SRE, you'll be the primary expert responsible for our entire compute ecosystem. Your key responsibilities will include:
Operate at the highest level of technical expertise and influence. Implement solutions that prevent problems at a fundamental level across organizational boundaries.
Design, implement, and lead large-scale, cross-functional projects to improve the reliability, performance, and efficiency of our core services and infrastructure (10x impact).
Drive the reduction of toil by developing and deploying sophisticated automation tools and frameworks, championing the "everything as code" philosophy.
Serve as a technical escalation point for critical incidents, perform deep-dive root cause analyses (RCAs), and implement robust corrective measures to prevent recurrence.
Define and implement SLOS, SLIs, and Error Budgets for critical services. Enhance system health, monitoring, logging, and tracing systems to provide comprehensive visibility into system performance.
Set the technical direction and best practices for the entire SRE and engineering organization. Mentor mid-level and senior engineers on design patterns, operational rigor, and reliability principles.
Lead as a technical expert with a proven track record of solving complex scaling and reliability challenges.
Required Qualifications
8+ years of progressive experience in Site Reliability Engineering, Production Engineering, or a closely related role.
Expert-level proficiency with AWS, including networking, compute, and storage.
Deep expertise in Kubernetes and the cloud-native ecosystem.
Fluency in at least one major scripting/programming language for automation and tooling (e.g., Python, Go, or Java).
Solid experience with monitoring and logging solutions such as Datadog.
Proven ability to design and implement robust, highly available distributed systems.
Demonstrated experience with Infrastructure as Code tools like Terraform.
Exceptional communication skills, capable of explaining complex technical issues to both technical and non-technical audiences.
Nice-to-Have
Experience implementing Service Mesh technologies (e.g., Istio, Linkerd).
Strong understanding of security principles and practices in a cloud environment.
Certifications such as CKA, CKAD.
Seniority Level Mid-Senior level
Employment Type Contract
Job Function Information Technology
Industry IT Services and IT Consulting
#J-18808-Ljbffr
Staff Site Reliability Engineer (SRE)
Location:
San Francisco, CA (Hybrid, Local Only)
Duration:
6+ months Contract
Profile:
12+ Years of experience
Employment Type:
W2 OR C2C (Either will work)
Job Description As our Staff SRE, you'll be the primary expert responsible for our entire compute ecosystem. Your key responsibilities will include:
Operate at the highest level of technical expertise and influence. Implement solutions that prevent problems at a fundamental level across organizational boundaries.
Design, implement, and lead large-scale, cross-functional projects to improve the reliability, performance, and efficiency of our core services and infrastructure (10x impact).
Drive the reduction of toil by developing and deploying sophisticated automation tools and frameworks, championing the "everything as code" philosophy.
Serve as a technical escalation point for critical incidents, perform deep-dive root cause analyses (RCAs), and implement robust corrective measures to prevent recurrence.
Define and implement SLOS, SLIs, and Error Budgets for critical services. Enhance system health, monitoring, logging, and tracing systems to provide comprehensive visibility into system performance.
Set the technical direction and best practices for the entire SRE and engineering organization. Mentor mid-level and senior engineers on design patterns, operational rigor, and reliability principles.
Lead as a technical expert with a proven track record of solving complex scaling and reliability challenges.
Required Qualifications
8+ years of progressive experience in Site Reliability Engineering, Production Engineering, or a closely related role.
Expert-level proficiency with AWS, including networking, compute, and storage.
Deep expertise in Kubernetes and the cloud-native ecosystem.
Fluency in at least one major scripting/programming language for automation and tooling (e.g., Python, Go, or Java).
Solid experience with monitoring and logging solutions such as Datadog.
Proven ability to design and implement robust, highly available distributed systems.
Demonstrated experience with Infrastructure as Code tools like Terraform.
Exceptional communication skills, capable of explaining complex technical issues to both technical and non-technical audiences.
Nice-to-Have
Experience implementing Service Mesh technologies (e.g., Istio, Linkerd).
Strong understanding of security principles and practices in a cloud environment.
Certifications such as CKA, CKAD.
Seniority Level Mid-Senior level
Employment Type Contract
Job Function Information Technology
Industry IT Services and IT Consulting
#J-18808-Ljbffr