DataRobot

Senior Backend Engineer

DataRobot, Boston, Massachusetts, us, 02298

DataRobot delivers AI that maximizes impact and minimizes business risk. Our platform and applications integrate into core business processes so teams can develop, deliver, and govern AI at scale. DataRobot empowers practitioners to deliver predictive and generative AI, and enables leaders to secure their AI assets. Organizations worldwide rely on DataRobot for AI that makes sense for their business — today and in the future.

AI Compute Team Overview Our AI Compute team is the engine at the heart of DataRobot. We build and operate the foundational computing backbone that powers all of DataRobot's AI products and our customers' most demanding workloads. The team works backwards from the needs of data scientists, ML engineers, and application developers to provide the raw power and sophisticated orchestration required to run agentic AI at any scale. We are the internal equivalent of a hyperscale cloud provider's core compute service, obsessed with performance, efficiency, and enabling the future of AI.

Key Responsibilities

Develop, test, and support new compute primitives and features of DataRobot.

Create and maintain automated unit tests and functional tests.

Design infrastructure for new features with peer input.

Build a system that ensures micro‑services are secure, performant, reliable, and can move from idea to production in hours.

Build a system that continuously recommends right‑sized computing resources for Kubernetes to ensure efficient cloud spending for ourselves and our customers.

Design and architect automated quality platforms to move from enterprise‑grade releases once a quarter to once per day or hour without sacrificing performance, security, or reliability.

Work with Product, Legal, and Security to ensure continuous delivery processes are compliant and secure.

Work with the team to ensure pipelines have clear playbooks and can operate 24/7 without manual intervention.

Collaborate with architects and platform engineers across R&D to set continuous delivery and performance requirements for all production services.

Collaborate with internal product managers to set roadmaps and milestones to deliver innovative, simple solutions to our many teams’ continuous delivery and platform engineering issues.

Manage individual projects and milestones with abundant communication of progress.

Knowledge, Skills, and Abilities

Expert proficiency in Kubernetes architecture and operations including resource management, scheduling, autoscaling, Gateway API/Ingress, Prometheus, and OpenTelemetry, or experience with other orchestrators such as Nomad/Slurm.

Experience with GPU clusters, multi‑node AI/ML, or as an administrator.

Strong computer science fundamentals: object‑oriented design, data structures, algorithm design, problem solving, and complexity analysis.

Understanding of design for scalability, performance, and reliability.

Deep experience with automated testing and test‑driven development.

Demonstrable knowledge of software architecture for large systems.

Real‑world experience decoupling monolithic software into smaller reusable components.

Self‑motivated and proactive, able to take ownership and deliver results.

Ability and willingness to learn about new technologies.

Effective communication and stakeholder collaboration.

Operational excellence: continuously define and improve SLAs based on customer experience.

Requisite Education and Experience

5+ years of experience building software.

Expertise in developing a wide variety of software with Python (4+ years).

Experience designing and operating diverse CI/CD pipelines with Harness.io.

Experience designing and innovating large‑scale, horizontally and vertically scaled build, test, and deployment systems for Kubernetes environments and familiarity with Helm charts.

Preferred:

Golang, Terraform and Terragrunt.

Chronosphere.

Multi‑cloud experience (AWS, Azure, GCP, and OpenShift).

Nice to Have

Direct experience with modern distributed compute frameworks (e.g., Ray, Dask) and large‑scale job schedulers (e.g., Slurm, Kueue).

CKAD (Certified Kubernetes Application Developer) certification.

Publicly reviewable contributions to interesting development projects.

Agentic AI experience.

Experience working with NVIDIA infrastructure in managing NVIDIA Dyson Operator, NVIDIA Dynamo Operator.

Benefits Medical, Dental & Vision Insurance; Flexible Time Off Program; Paid Holidays; Paid Parental Leave; Global Employee Assistance Program; and more.

Equal Employment Opportunity DataRobot is a proud Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, veteran status, or disability. We are committed to providing reasonable accommodations to applicants with physical and mental disabilities. Please see the United States Department of Labor’s EEO poster for additional information.

#J-18808-Ljbffr