Disney Cruise Line - The Walt Disney Company
Sr Site Reliability Engineer
Disney Cruise Line - The Walt Disney Company, Celebration, Florida, United States
Overview
“We Power the Magic!” That’s our motto at Disney Experiences Tech & Digital (DXT). Our team creates world-class immersive digital experiences for Disney’s premier vacation brands, including Disney’s Parks & Resorts worldwide, Disney Cruise Line, Aulani, a Disney Resort & Spa, and Disney Vacation Club. We are responsible for the end-to-end digital and physical Guest experience for all technology and digital-led initiatives across the Attractions & Entertainment, Food & Beverage, Resorts & Transportation and Merchandise lines of business, as well as other initiatives including MyDisneyExperience and Hey, Disney! This role sits in the DSE Technologies Operations organization within Technology & Digital for Disney Experiences. It works closely with Applications Teams from across the company. The Sr. Site Reliability Engineer will report to the Manager, Technology Operations. In this role, you will coordinate and manage retrospective discussions and continued troubleshooting in support of operational systems. You will work closely with infrastructure and application teams to troubleshoot, determine root cause, and provide recommendations for long-term fixes and interim mitigation steps, with an eye toward increased availability and reduced time to recover in the event of a systems failure. You will also be deeply involved with designing and refreshing our lower environment strategy to better support release and deployment activities. The DSE Technology Operations team provides operational support for the production systems used by our guests, cast, and crew for Disney Cruise Line, Disney Vacation Club, and all DSE emerging businesses. What You’ll Do
Drive a DevOps culture among peers and developers Design, build, and support product platforms Consult, design, build, and support development pipelines; automate infrastructure and operations; create telemetry for monitoring; engineer high reliability; and reinforce best practices to secure company data Perform systems administration on Linux, Windows, and Kubernetes, including AWS, Google Cloud, and Azure, with extensive experience in web technologies, source control management using Git, AWX, and Ansible Provide systems administration knowledge across Windows, Linux, and Kubernetes platforms, including knowledge of systems, network, operational excellence, application stability, security, performance, capacity management, and documentation Stay up to date with emerging technologies Collaborate with cross-functional teams to ensure timely and comprehensive resolution of system issues Design and implement robust monitoring and tracking solutions for Windows, Linux, and containerized systems leveraging existing investments in productivity and related tools Coordinate and organize retrospective discussions following major incident outages or key system challenges; review troubleshooting and resolution against best practices Apply SDLC, ITIL, and other industry-wide best practices to leverage incident and problem management to increase system availability while decreasing time to resolve and return to service Provide expert-level support in troubleshooting and resolving application-related incidents when needed Knowledge of systems administration on both Windows and Linux platforms, and bring knowledge on systems, network, operational excellence, application stability, security, performance, and capacity management to application and infrastructure teams Lead technical projects and ensure smooth delivery Collaborate with Security Operations teams for secure solutions Strong troubleshooting skills across systems, network, and code Proactive approach to continuous learning and skill development; interest in mastering emerging data engineering tools and methodologies Role includes responsibility for lower-environment design, build, and management; developing automation and monitoring for both lower and production environments Coordinate and automate deployment of new software builds across multiple lower environments Required Qualifications & Skills
Minimum 5 years of related work experience Proficient in agile environments Applied understanding of observability principles using relevant tools Hands-on experience with CI tools like GitLab, Ansible, and Azure DevOps Proficient in configuration management tools: Terraform, Ansible, Chef Experience in procedural programming languages (Python, Perl, Ruby, Java, Go, Rust, C/C++, PowerShell) Skilled in Cloud environments (AWS, Azure, Google Cloud) Collaborative in building reliable, scalable enterprise systems Capable of identifying root causes in large-scale distributed systems Proficient in UNIX/Linux and Windows and Kubernetes administration, troubleshooting, and security Leading technical projects and ensuring smooth delivery Collaborative work with Security Operations teams for secure solutions Strong troubleshooting skills across systems, network, and code Proactive and continuous learning mindset with an interest in emerging data engineering tools and methodologies Required Education
Bachelor’s degree in Computer Science, Information Systems, Software, Electrical or Electronics Engineering, or comparable field of study, and/or equivalent work experience
#J-18808-Ljbffr
“We Power the Magic!” That’s our motto at Disney Experiences Tech & Digital (DXT). Our team creates world-class immersive digital experiences for Disney’s premier vacation brands, including Disney’s Parks & Resorts worldwide, Disney Cruise Line, Aulani, a Disney Resort & Spa, and Disney Vacation Club. We are responsible for the end-to-end digital and physical Guest experience for all technology and digital-led initiatives across the Attractions & Entertainment, Food & Beverage, Resorts & Transportation and Merchandise lines of business, as well as other initiatives including MyDisneyExperience and Hey, Disney! This role sits in the DSE Technologies Operations organization within Technology & Digital for Disney Experiences. It works closely with Applications Teams from across the company. The Sr. Site Reliability Engineer will report to the Manager, Technology Operations. In this role, you will coordinate and manage retrospective discussions and continued troubleshooting in support of operational systems. You will work closely with infrastructure and application teams to troubleshoot, determine root cause, and provide recommendations for long-term fixes and interim mitigation steps, with an eye toward increased availability and reduced time to recover in the event of a systems failure. You will also be deeply involved with designing and refreshing our lower environment strategy to better support release and deployment activities. The DSE Technology Operations team provides operational support for the production systems used by our guests, cast, and crew for Disney Cruise Line, Disney Vacation Club, and all DSE emerging businesses. What You’ll Do
Drive a DevOps culture among peers and developers Design, build, and support product platforms Consult, design, build, and support development pipelines; automate infrastructure and operations; create telemetry for monitoring; engineer high reliability; and reinforce best practices to secure company data Perform systems administration on Linux, Windows, and Kubernetes, including AWS, Google Cloud, and Azure, with extensive experience in web technologies, source control management using Git, AWX, and Ansible Provide systems administration knowledge across Windows, Linux, and Kubernetes platforms, including knowledge of systems, network, operational excellence, application stability, security, performance, capacity management, and documentation Stay up to date with emerging technologies Collaborate with cross-functional teams to ensure timely and comprehensive resolution of system issues Design and implement robust monitoring and tracking solutions for Windows, Linux, and containerized systems leveraging existing investments in productivity and related tools Coordinate and organize retrospective discussions following major incident outages or key system challenges; review troubleshooting and resolution against best practices Apply SDLC, ITIL, and other industry-wide best practices to leverage incident and problem management to increase system availability while decreasing time to resolve and return to service Provide expert-level support in troubleshooting and resolving application-related incidents when needed Knowledge of systems administration on both Windows and Linux platforms, and bring knowledge on systems, network, operational excellence, application stability, security, performance, and capacity management to application and infrastructure teams Lead technical projects and ensure smooth delivery Collaborate with Security Operations teams for secure solutions Strong troubleshooting skills across systems, network, and code Proactive approach to continuous learning and skill development; interest in mastering emerging data engineering tools and methodologies Role includes responsibility for lower-environment design, build, and management; developing automation and monitoring for both lower and production environments Coordinate and automate deployment of new software builds across multiple lower environments Required Qualifications & Skills
Minimum 5 years of related work experience Proficient in agile environments Applied understanding of observability principles using relevant tools Hands-on experience with CI tools like GitLab, Ansible, and Azure DevOps Proficient in configuration management tools: Terraform, Ansible, Chef Experience in procedural programming languages (Python, Perl, Ruby, Java, Go, Rust, C/C++, PowerShell) Skilled in Cloud environments (AWS, Azure, Google Cloud) Collaborative in building reliable, scalable enterprise systems Capable of identifying root causes in large-scale distributed systems Proficient in UNIX/Linux and Windows and Kubernetes administration, troubleshooting, and security Leading technical projects and ensuring smooth delivery Collaborative work with Security Operations teams for secure solutions Strong troubleshooting skills across systems, network, and code Proactive and continuous learning mindset with an interest in emerging data engineering tools and methodologies Required Education
Bachelor’s degree in Computer Science, Information Systems, Software, Electrical or Electronics Engineering, or comparable field of study, and/or equivalent work experience
#J-18808-Ljbffr