Logo
NY Staffing

Engineer, ML Infrastructure (Kubernetes/AWS) (Open to remote)

NY Staffing, New York, New York, United States, 10001

Save Job

Ml Infrastructure Engineer

Penguin Random House is looking for a skilled ML infrastructure engineer with expertise in AWS and Kubernetes to join our team. While Kubernetes will be a key focus, this role also requires working across our broader cloud systems, such as Databricks and Snowflake, to ensure infrastructure decisions support the full ML lifecycle. In this role you will work closely with our development, operations, and data science teams to ensure the reliability and scalability of our cloud infrastructure, and to ensure it is well-integrated with ML development patterns. This is a hands-on engineering role focused on creating the foundation that enables fast, compliant, and reliable ML delivery at scale. Specific responsibilities include: Designs, implements, and manages Kubernetes clusters on AWS using tools like Terraform; automates the deployment, scaling, and monitoring of containerized applications. Automates deployment and scaling of ML containers and cloud-native services. Supports infrastructure integration with platforms such as Databricks and Snowflake. Ensures the security and compliance of cloud infrastructure and applications. Monitors infrastructure performance and troubleshoots issues across cloud services and orchestration layers. Stays up-to-date with the latest trends and technologies in cloud computing and container orchestration. Please apply if you meet the following qualifications: Proven experience with AWS cloud services, Kubernetes and Docker Strong background in Linux systems engineering Proficiency in programming languages such as Python, Java, or Go Strong understanding of CI/CD tools such as GitLab CI Knowledge of infrastructure as code tools like Terraform Excellent problem-solving skills and attention to detail Strong communication and collaboration skills Holds a CKA or CKAD certification from Cloud Native Computing Foundation Comfortable collaborating across DevOps, infrastructure, and ML teams Preferred Qualifications: Experience with monitoring tools like Prometheus, Grafana, Datadog, and Splunk Prior exposure to platform reliability, observability, and operational scaling challenges Experience with the deployment and scaling of ML models and workloads in production environments The salary range for this position is $140,000 - $160,000. All positions are currently eligible for annual profit award or bonus, subject to company results. Please apply by August 15, 2025 and include your resume and cover letter for consideration. Penguin Random House values the array of talents and perspectives that a diverse workforce brings. All qualified applicants will receive consideration for employment without regard to race, national origin, religion, age, color, sex, sexual orientation, gender identity, disability, or protected veteran status.