Fastly
Overview
We’re seeking a versatile and experienced Site Reliability Engineer who thrives in a fast paced, high scale environment and is passionate about reliability, performance, automation, and tooling. Reporting to the VP of Performance Center Operations, you’ll serve as a key individual contributor within the Performance Center Operations team. The Fastly Performance Center is the strategic and operational engine that ensures the highest level of performance for the most demanding workloads on the Internet. We proactively safeguard quality of service at the global scale, drive technical and product strategies that shape our platform’s evolution, and directly influence revenue outcomes by ensuring our customers succeed. The scope of this role will evolve with the needs of the business and the maturity of the program. Additional responsibilities may be assigned based on individual expertise and strategic priorities from the Office of the Founder & CTO. What You'll Do
This role is approximately 35% Site Reliability Engineering, 35% Data Analysis / Traffic Insights, and 30% Cross-functional Operations, balancing technical expertise with collaboration and strategic impact. Drive the development of automation and observability tooling that improves operational efficiency and platform reliability, including traffic monitoring, alerting, and surveillance tools. Partner with observability teams to implement and improve existing dashboards (Grafana, Prometheus) and metrics pipelines that provide meaningful visibility into traffic patterns, surges, and seasonal trends. Help define SLIs/SLOs, and improve monitoring frameworks, ensuring alerts and dashboards reflect operational reality and proactively surface issues before customer impact. Collaborate with data/analytics teams to leverage data pipelines (e.g., SQL, BigQuery or other large-scale data stores) for trend analysis, capacity planning, traffic pattern recognition Step in to run daily operational standups or coordination meetings as needed. Ensuring priorities are clear, follow ups are tracked, and cross functional execution maintains momentum. Facilitate cross-team communication during high-impact initiatives or incident reviews, surfacing blockers early and maintaining execution momentum Perform root-cause investigations of performance, scalability or traffic anomalies, translate learnings into improvements in tooling and architecture Act as a technical liaison, helping contextualize traffic behavior, system performance, and support escalations with clear insight Help define and evolve run-books, incident response processes, post-mortems, knowledge base, ensuring that repeated issues are proactively surfaced and addressed via automation or tooling rather than reactive firefighting Provide leadership in incident response, mitigation and communication across teams Monitor seasonal patterns, major events, and global traffic distribution, helping ensure the platform remains resilient during shifts in demand What We're Looking For
8+ years of experience in Site Reliability Engineering, Systems Engineering, Platform/Infrastructure Engineering, or equivalent roles. Professional experience operating in CDN, streaming media, or other high-volume internet traffic environments. Deep understanding of network/distributed/cloud systems: TCP/IP, DNS, HTTP/S, TLS, caching/proxy/CDN technologies; direct experience in CDN, Web Application and API Security a plus Demonstrated ability to build automation, tooling, and observability systems: e.g., dashboards, alerts, instrumentation, data pipelines. Experience with Prometheus, Grafana, BigQuery/SQL, etc Hands-on experience with scripting or programming (e.g., Python, Go, Shell) and comfortable building tooling rather than just consuming. Experience working cross-functionally with engineering, infrastructure, operations, analytics, and customer/account teams. Strong communication skills, ability to translate technical findings to non-technical stakeholders. Demonstrated ability to coordinate complex technical work across multiple teams, facilitate daily standups or working sessions, and maintain operational momentum in complex, fast-moving environments. Proven track record of driving mission-critical reliability and performance improvements in production systems. Strong sense of ownership and accountability Experience with monitoring/alerting systems and incident response. Bonus for experience with live streaming, high-variability traffic, or global seasonality at scale. Preferred Certifications and Experience
We’ll be super impressed if you have experience in any of these: Experience with large-scale data analytics systems (BigQuery, Spark, Presto) to derive operational insights from traffic telemetry Familiarity with cloud platforms (AWS, GCP, Azure), infrastructure as code, or container orchestration (Terraform, Kubernetes) Experience evaluating build‐vs‐buy decisions and driving platform wide tooling improvements Background in media, live events, or streaming operations in a high throughput, latency sensitive environment a plus Work Hours
This position will require you to be available during core business hours and occasional nights and weekends as required for on call and incident response. Work Location(s) & Travel Requirements
The preferred locations for this position are: San Francisco, CA New York, NY Denver, CO Fastly currently embraces a largely hybrid model for most roles which allows employees flexibility to split their time between the office and home. There is a strong preference for Hybrid near a local office. However, we may be willing to consider exceptionally qualified remote candidates within the US. This position will require travel as required by your role or requested by your manager. EEO and Benefits
Fastly is committed to equal employment opportunity and providing employees with a safe and welcoming work environment free of discrimination and harassment. We offer a comprehensive benefits package including medical, dental, and vision insurance, 401(k) with company match, Employee Stock Purchase Program, flexible vacation, paid sick leave, and holiday and wellness programs. For 2025, we offer 11 paid local holidays and 11 paid company wellness days.
#J-18808-Ljbffr
We’re seeking a versatile and experienced Site Reliability Engineer who thrives in a fast paced, high scale environment and is passionate about reliability, performance, automation, and tooling. Reporting to the VP of Performance Center Operations, you’ll serve as a key individual contributor within the Performance Center Operations team. The Fastly Performance Center is the strategic and operational engine that ensures the highest level of performance for the most demanding workloads on the Internet. We proactively safeguard quality of service at the global scale, drive technical and product strategies that shape our platform’s evolution, and directly influence revenue outcomes by ensuring our customers succeed. The scope of this role will evolve with the needs of the business and the maturity of the program. Additional responsibilities may be assigned based on individual expertise and strategic priorities from the Office of the Founder & CTO. What You'll Do
This role is approximately 35% Site Reliability Engineering, 35% Data Analysis / Traffic Insights, and 30% Cross-functional Operations, balancing technical expertise with collaboration and strategic impact. Drive the development of automation and observability tooling that improves operational efficiency and platform reliability, including traffic monitoring, alerting, and surveillance tools. Partner with observability teams to implement and improve existing dashboards (Grafana, Prometheus) and metrics pipelines that provide meaningful visibility into traffic patterns, surges, and seasonal trends. Help define SLIs/SLOs, and improve monitoring frameworks, ensuring alerts and dashboards reflect operational reality and proactively surface issues before customer impact. Collaborate with data/analytics teams to leverage data pipelines (e.g., SQL, BigQuery or other large-scale data stores) for trend analysis, capacity planning, traffic pattern recognition Step in to run daily operational standups or coordination meetings as needed. Ensuring priorities are clear, follow ups are tracked, and cross functional execution maintains momentum. Facilitate cross-team communication during high-impact initiatives or incident reviews, surfacing blockers early and maintaining execution momentum Perform root-cause investigations of performance, scalability or traffic anomalies, translate learnings into improvements in tooling and architecture Act as a technical liaison, helping contextualize traffic behavior, system performance, and support escalations with clear insight Help define and evolve run-books, incident response processes, post-mortems, knowledge base, ensuring that repeated issues are proactively surfaced and addressed via automation or tooling rather than reactive firefighting Provide leadership in incident response, mitigation and communication across teams Monitor seasonal patterns, major events, and global traffic distribution, helping ensure the platform remains resilient during shifts in demand What We're Looking For
8+ years of experience in Site Reliability Engineering, Systems Engineering, Platform/Infrastructure Engineering, or equivalent roles. Professional experience operating in CDN, streaming media, or other high-volume internet traffic environments. Deep understanding of network/distributed/cloud systems: TCP/IP, DNS, HTTP/S, TLS, caching/proxy/CDN technologies; direct experience in CDN, Web Application and API Security a plus Demonstrated ability to build automation, tooling, and observability systems: e.g., dashboards, alerts, instrumentation, data pipelines. Experience with Prometheus, Grafana, BigQuery/SQL, etc Hands-on experience with scripting or programming (e.g., Python, Go, Shell) and comfortable building tooling rather than just consuming. Experience working cross-functionally with engineering, infrastructure, operations, analytics, and customer/account teams. Strong communication skills, ability to translate technical findings to non-technical stakeholders. Demonstrated ability to coordinate complex technical work across multiple teams, facilitate daily standups or working sessions, and maintain operational momentum in complex, fast-moving environments. Proven track record of driving mission-critical reliability and performance improvements in production systems. Strong sense of ownership and accountability Experience with monitoring/alerting systems and incident response. Bonus for experience with live streaming, high-variability traffic, or global seasonality at scale. Preferred Certifications and Experience
We’ll be super impressed if you have experience in any of these: Experience with large-scale data analytics systems (BigQuery, Spark, Presto) to derive operational insights from traffic telemetry Familiarity with cloud platforms (AWS, GCP, Azure), infrastructure as code, or container orchestration (Terraform, Kubernetes) Experience evaluating build‐vs‐buy decisions and driving platform wide tooling improvements Background in media, live events, or streaming operations in a high throughput, latency sensitive environment a plus Work Hours
This position will require you to be available during core business hours and occasional nights and weekends as required for on call and incident response. Work Location(s) & Travel Requirements
The preferred locations for this position are: San Francisco, CA New York, NY Denver, CO Fastly currently embraces a largely hybrid model for most roles which allows employees flexibility to split their time between the office and home. There is a strong preference for Hybrid near a local office. However, we may be willing to consider exceptionally qualified remote candidates within the US. This position will require travel as required by your role or requested by your manager. EEO and Benefits
Fastly is committed to equal employment opportunity and providing employees with a safe and welcoming work environment free of discrimination and harassment. We offer a comprehensive benefits package including medical, dental, and vision insurance, 401(k) with company match, Employee Stock Purchase Program, flexible vacation, paid sick leave, and holiday and wellness programs. For 2025, we offer 11 paid local holidays and 11 paid company wellness days.
#J-18808-Ljbffr