Logo
Vinsys Information Technology Inc

Foundations Site Reliability Engineer

Vinsys Information Technology Inc, Seattle, Washington, us, 98127

Save Job

3 days ago Be among the first 25 applicants

Overview SRE | Foundations | Site Reliability Engineer (Contract)

About This Team Site Reliability Engineering

We are looking for a motivated engineer to join the Foundations team which is responsible for observability and monitoring in Site Reliability Engineering, guiding the digital organization to improve the practice of reliability here. We are a consultative enablement team providing guidance and support to product engineering teams for the development of high-quality and resilient software systems through the use of monitoring tools and practices. SRE partners with many product engineering teams across digital and beyond to infuse the concepts and practices of reliability into engineering process and deliverables. The Foundations team owns the management of our monitoring tools and the best practices for using those tools to provide total visibility into our systems. This role requires a vision and strategy for monitoring and how to manage it across a disparate organization.

Responsibilities

As an Engineer II on the SRE Foundations team, you are a technical contributor and domain leader in observability and reliability. Your day-to-day responsibilities include design, implementation, and maintenance of robust monitoring solutions, creating insightful dashboards, identifying relevant metrics, and driving efficient problem management practices.

Help identify observability maturity opportunities and roadblocks to success for digital teams and clearing those roadblocks.

Partner closely with Product Owners and Scrum Masters to manage scope and balance between support and investment work.

Clearly communicate risks to partners for deliverables.

Qualifications

Observability & Monitoring: Design, implement, and optimize observability solutions across metrics, logging, and tracing. Build and maintain dashboards and alerts (e.g., Datadog) that provide meaningful insight into system health and performance. Define and support adoption of SLOs, SLIs, and error budgets.

Incident & Problem Management: Participate in and lead incident response during major outages and critical events. Support on-call rotations during key events. Conduct and contribute to RCAs and post-incident reviews, driving follow-up actions and remediation plans. Collaborate to enhance incident playbooks, reduce MTTD/MTTR, and improve operational readiness. Apply ITIL principles for incident, problem, and change management.

Team Collaboration & Enablement: Partner with digital product teams to integrate observability best practices into development/deployment workflows. Identify tooling and knowledge gaps; champion improvements and automation to reduce toil and increase visibility. Support prioritization between support, investment, and innovation; mentor junior team members.

Continuous Improvement & Strategic Contribution: Stay up to date with SRE and observability trends; evaluate and adopt new tools. Contribute to domain-level standards and practices. Influence reliability strategy by sharing insights with senior engineers and leadership.

Requirements: Bachelor’s degree in Computer Science, Engineering, or equivalent experience. 8–12 years of software engineering or SRE experience with strong exposure to observability. Strong experience with Datadog, Splunk, and distributed tracing. Proven incident management, RCA facilitation, and on-call experience, especially during peak traffic. Understanding of ITIL concepts. Experience building dashboards, alerts, and SLOs/SLIs. Strong debugging/root cause analysis skills. Excellent collaboration and communication. Familiarity with Terraform, Kubernetes, and cloud-native systems. Certifications such as CKA or Terraform Associate are a plus.

Bonus

Deep expertise in observability tooling (Datadog, Splunk).

Experience in e-commerce or high-availability digital platforms.

Background in product ownership or leading reliability-focused initiatives.

Must haves

Acknowledges the presence of choice in every moment and takes personal responsibility for their life.

Required Skills Terraform, Kubernetes, Splunk

Additional Details

Background Check: No

Drug Screen: No

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Software Development

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr