Qtsolv
Responsibilities:
You will lead all aspects of DevOps, CI/CD, Observability, Alerting and Reliability; working to create and maintain a reliable, secure, scalable, and highly resilient data streaming platform. You will lead team-level DevOps and SRE outcomes and initiatives. Configure and improve cloud infrastructure for service availability, resiliency, performance, and cost efficiency with increasing load over time. Be accountable for SLOs of the services by driving and improving the process including service reviews, fire drills, and HA assessment. Create innovative solutions to monitor health checks of data streaming apps. Keep the system updated in time for security compliance. Engage in technical discussions and technical decision-making. Build tools to improve operation efficiency. Serve as a primary point responsible for the overall health, performance, and capacity of our data streaming platform across Autodesk. Drive the design, implementation, and management for expanding observability infrastructure, keeping up to date with new technologies. Come up with innovative solutions for a resilient data streaming platform at scale. You will collaborate with software architects, product managers, and software developers to iteratively transform high-level SRE, DevOps requirements into enhancements that are delivered incrementally. Lead sustainable incident response, blameless postmortems, and production improvements that result in direct business opportunities. Automate deployment, scaling, and management of infrastructure using modern DevOps tools and practices. Monitor and optimize system performance, troubleshoot issues, and implement solutions. Implement and maintain configuration management and infrastructure as code (IaC) using Terraform. Define and document best practices across all pillars of DevOps/SRE. Minimum Qualifications:
BS or MS in Computer Science or related technical field or relevant experience. 5+ years of software engineering experience with proven experience in DevOps and SRE accountable for SLOs. Hands-on experience working with AWS (Amazon Web Services) demonstrating senior-level experience with development and deployment of AWS services, specifically S3, Lambda, SQS/SNS, and databases (Aurora, DynamoDB). Understanding and curiosity of SRE best practices, architectures, and methods. Experience in Continuous Delivery, deployment with Terraform. Excellent experience in Java, Python, Groovy, and other programming languages. Good knowledge of resiliency patterns and cloud security. Proficiency in using observability tools such as Grafana, Splunk, Dynatrace, DataDog, OpenTelemetry, or Prometheus. Experience with security compliance, such as SOC2. Hands-on experience with data streaming, transformation, and ETL technologies. Understanding of Apache Flink, Kinesis, Kafka, and Kubernetes. Demonstrated experience managing clusters of Kubernetes, Mesos, etc. Proven capability to lead incident response, drive root cause analysis, and implement preventive measures. Expertise in DevOps/SRE practices, including IaC, configuration management, container technologies, microservices, CI/CD processes, etc. Strong problem-solving skills and capability to work on complex systems. Experience in working in an Agile environment. Experience in working with a distributed team. Preferred Qualifications:
Passion to run and improve customer-facing systems with a high degree of availability (four 9s). Experience with databases and database design principles at cloud scale. Excellent verbal and written communication skills with experience collaborating in a dispersed multicultural team to deliver projects, sometimes responsible for leading initiatives. A perpetual learner who often finds themselves ideating about new and improved ways of doing things and is confident to share ideas with the rest of the engineering team. Demonstrates mature judgment when making engineering decisions and can reliably make the call between elegant and practical solutions.
#J-18808-Ljbffr
You will lead all aspects of DevOps, CI/CD, Observability, Alerting and Reliability; working to create and maintain a reliable, secure, scalable, and highly resilient data streaming platform. You will lead team-level DevOps and SRE outcomes and initiatives. Configure and improve cloud infrastructure for service availability, resiliency, performance, and cost efficiency with increasing load over time. Be accountable for SLOs of the services by driving and improving the process including service reviews, fire drills, and HA assessment. Create innovative solutions to monitor health checks of data streaming apps. Keep the system updated in time for security compliance. Engage in technical discussions and technical decision-making. Build tools to improve operation efficiency. Serve as a primary point responsible for the overall health, performance, and capacity of our data streaming platform across Autodesk. Drive the design, implementation, and management for expanding observability infrastructure, keeping up to date with new technologies. Come up with innovative solutions for a resilient data streaming platform at scale. You will collaborate with software architects, product managers, and software developers to iteratively transform high-level SRE, DevOps requirements into enhancements that are delivered incrementally. Lead sustainable incident response, blameless postmortems, and production improvements that result in direct business opportunities. Automate deployment, scaling, and management of infrastructure using modern DevOps tools and practices. Monitor and optimize system performance, troubleshoot issues, and implement solutions. Implement and maintain configuration management and infrastructure as code (IaC) using Terraform. Define and document best practices across all pillars of DevOps/SRE. Minimum Qualifications:
BS or MS in Computer Science or related technical field or relevant experience. 5+ years of software engineering experience with proven experience in DevOps and SRE accountable for SLOs. Hands-on experience working with AWS (Amazon Web Services) demonstrating senior-level experience with development and deployment of AWS services, specifically S3, Lambda, SQS/SNS, and databases (Aurora, DynamoDB). Understanding and curiosity of SRE best practices, architectures, and methods. Experience in Continuous Delivery, deployment with Terraform. Excellent experience in Java, Python, Groovy, and other programming languages. Good knowledge of resiliency patterns and cloud security. Proficiency in using observability tools such as Grafana, Splunk, Dynatrace, DataDog, OpenTelemetry, or Prometheus. Experience with security compliance, such as SOC2. Hands-on experience with data streaming, transformation, and ETL technologies. Understanding of Apache Flink, Kinesis, Kafka, and Kubernetes. Demonstrated experience managing clusters of Kubernetes, Mesos, etc. Proven capability to lead incident response, drive root cause analysis, and implement preventive measures. Expertise in DevOps/SRE practices, including IaC, configuration management, container technologies, microservices, CI/CD processes, etc. Strong problem-solving skills and capability to work on complex systems. Experience in working in an Agile environment. Experience in working with a distributed team. Preferred Qualifications:
Passion to run and improve customer-facing systems with a high degree of availability (four 9s). Experience with databases and database design principles at cloud scale. Excellent verbal and written communication skills with experience collaborating in a dispersed multicultural team to deliver projects, sometimes responsible for leading initiatives. A perpetual learner who often finds themselves ideating about new and improved ways of doing things and is confident to share ideas with the rest of the engineering team. Demonstrates mature judgment when making engineering decisions and can reliably make the call between elegant and practical solutions.
#J-18808-Ljbffr