Oracle
Senior Principal Software Engineer, AI Infra Compute
2 weeks ago Be among the first 25 applicants
Job Description
Our team is the GPU Availability and Monitoring team in the Compute Org. We are responsible for designing and developing architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services. These are essential for running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband. We are looking for a highly skilled and motivated distributed systems engineer who can architect solutions to scale and optimize Monitoring and Repair solutions for AI infrastructure components such as the GPU control plane and GPU data plane, providing computing resources to customer AI workloads.
You will provide technical leadership to the team, bring clarity to ambiguous problems, and come up with innovative solutions. You will collaborate with cross‑functional teams to enhance our AI infrastructure, delivering an exceptional customer experience and peak performance.
Responsibilities
Architect solutions to scale and optimize Monitoring and Repair for components like GPU, CPU, Network, and Storage to improve customer experience and workload performance.
Develop “best‑in‑class” AI compute infrastructure by ensuring services and components are modular, secure, reliable, diagnosable, actively monitored, compliant, and reusable.
Collaborate with cross‑functional teams—development, operations, product management—to understand requirements and design solutions.
Optimize and improve the software development process to increase developer efficiency.
Mentor junior developers and promote modern engineering practices: data/telemetry‑driven decisions, well‑defined component interfaces, design reviews, coding standards, code reviews, unit and integration testing, and active production monitoring.
Develop benchmark metrics and automation to track performance and reliability across customer workloads, correlating with the lower infrastructure stack.
Stay updated with industry trends, emerging technologies, and best practices in distributed systems and AI infrastructure management.
Qualifications & Skills
BS (or equivalent experience) in Computer Science, Engineering, or related field.
10+ years of software development experience in languages such as C, C++, C#, Java, Go, or Rust.
5+ years designing and developing large‑scale distributed systems, services, and infrastructure.
3+ years providing technical leadership and clarity to cross‑functional teams and projects.
Systematic problem‑solving approach, strong communication skills, sense of ownership, and drive.
Ability to adapt to a fast‑paced, dynamic environment and manage multiple tasks and priorities effectively.
Experience with Agile principles, data modeling, data warehousing, and data governance.
Experience with cloud infrastructure (OCI, AWS, Azure, GCP).
Operating system expertise: Linux, MacOS.
Scripting languages: Bash, Perl, Ruby.
Familiarity with containerization technologies such as Docker.
API design and development experience—RESTful APIs, API gateways, API security.
Familiarity with API documentation tools such as Swagger/OpenAPI.
Experience with AI‑powered tools and platforms: chatbots, virtual assistants, predictive analytics.
Preferred Qualifications
Experience managing cloud infrastructure with hundreds of thousands of servers.
Experience with containerization technologies such as Docker and Kubernetes.
Experience scheduling high‑performance workloads on Kubernetes or Slurm.
US: Hiring Range in USD from: $96,800 - $251,600 per year. May be eligible for bonus, equity, and compensation deferral.
Career Level: IC5
Seniority level: Mid‑Senior level
Employment type: Full‑time
Job function: Engineering and Information Technology
Industries: IT Services and IT Consulting
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
#J-18808-Ljbffr
Job Description
Our team is the GPU Availability and Monitoring team in the Compute Org. We are responsible for designing and developing architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services. These are essential for running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband. We are looking for a highly skilled and motivated distributed systems engineer who can architect solutions to scale and optimize Monitoring and Repair solutions for AI infrastructure components such as the GPU control plane and GPU data plane, providing computing resources to customer AI workloads.
You will provide technical leadership to the team, bring clarity to ambiguous problems, and come up with innovative solutions. You will collaborate with cross‑functional teams to enhance our AI infrastructure, delivering an exceptional customer experience and peak performance.
Responsibilities
Architect solutions to scale and optimize Monitoring and Repair for components like GPU, CPU, Network, and Storage to improve customer experience and workload performance.
Develop “best‑in‑class” AI compute infrastructure by ensuring services and components are modular, secure, reliable, diagnosable, actively monitored, compliant, and reusable.
Collaborate with cross‑functional teams—development, operations, product management—to understand requirements and design solutions.
Optimize and improve the software development process to increase developer efficiency.
Mentor junior developers and promote modern engineering practices: data/telemetry‑driven decisions, well‑defined component interfaces, design reviews, coding standards, code reviews, unit and integration testing, and active production monitoring.
Develop benchmark metrics and automation to track performance and reliability across customer workloads, correlating with the lower infrastructure stack.
Stay updated with industry trends, emerging technologies, and best practices in distributed systems and AI infrastructure management.
Qualifications & Skills
BS (or equivalent experience) in Computer Science, Engineering, or related field.
10+ years of software development experience in languages such as C, C++, C#, Java, Go, or Rust.
5+ years designing and developing large‑scale distributed systems, services, and infrastructure.
3+ years providing technical leadership and clarity to cross‑functional teams and projects.
Systematic problem‑solving approach, strong communication skills, sense of ownership, and drive.
Ability to adapt to a fast‑paced, dynamic environment and manage multiple tasks and priorities effectively.
Experience with Agile principles, data modeling, data warehousing, and data governance.
Experience with cloud infrastructure (OCI, AWS, Azure, GCP).
Operating system expertise: Linux, MacOS.
Scripting languages: Bash, Perl, Ruby.
Familiarity with containerization technologies such as Docker.
API design and development experience—RESTful APIs, API gateways, API security.
Familiarity with API documentation tools such as Swagger/OpenAPI.
Experience with AI‑powered tools and platforms: chatbots, virtual assistants, predictive analytics.
Preferred Qualifications
Experience managing cloud infrastructure with hundreds of thousands of servers.
Experience with containerization technologies such as Docker and Kubernetes.
Experience scheduling high‑performance workloads on Kubernetes or Slurm.
US: Hiring Range in USD from: $96,800 - $251,600 per year. May be eligible for bonus, equity, and compensation deferral.
Career Level: IC5
Seniority level: Mid‑Senior level
Employment type: Full‑time
Job function: Engineering and Information Technology
Industries: IT Services and IT Consulting
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
#J-18808-Ljbffr