Oracle
OCI (Oracle Cloud Infrastructure) AI Infrastructure is at the forefront of building a cutting‑edge, ultra‑high‑performance GPU platform designed to support AI/ML/HPC workloads. This role is part of the GPU Availability and Monitoring team in the Compute Org, responsible for designing and developing architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services. You will work on distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband.
Responsibilities
Work independently in ambiguous situations to ensure adherence to published standards and practices.
Design, develop, troubleshoot, and debug software programs for various cloud infrastructure components, including databases, applications, tools, and networks.
Take an active role in defining and evolving standard practices and procedures for software engineering, with a focus on AI‑driven development.
Design and develop software for tasks associated with developing, designing, and debugging software applications or operating systems, leveraging AI and ML techniques.
Lead the development of critical initiatives, including:
Design and implement spike detection mechanisms for provisioning failures to minimize operational disruptions using ML algorithms.
Expand integrations with Kafka to enable near real‑time actions supporting 1‑Day SLO objectives for hardware repairs, utilizing event‑driven architecture and stream processing.
Develop an automated ticket routing framework to streamline workflows, enhance efficiency, and reduce operational overhead, powered by NLP and ML.
Accelerate dedicated initiatives through collaborative efforts with cross‑functional teams and customers, applying AI‑driven insights and recommendations.
Harness the power of AI and ML to create innovative tools and frameworks that automate testing, simulate complex environments, and reproduce incidents, freeing up human ingenuity to focus on higher‑value tasks.
Collaborate and lead technical discussions across multiple teams to ensure seamless integrations and effective problem‑solving.
Provide direction and mentoring to junior engineers, sharing knowledge and expertise to promote growth and development.
Qualifications
Experience with Python, Java, or TypeScript.
Hands‑on experience in AI/ML, especially leveraging ML for operational monitoring and automation.
Strong background in Linux system programming and kernel‑level development.
Experience with Docker and container orchestration.
Knowledge of RESTful API design and API security.
Experience with cloud platforms (OCI, AWS, Azure, GCP) and familiar with cloud infrastructure concepts.
Ability to work independently and collaborate across teams.
Excellent problem‑solving and debugging skills.
Technical Skills
Programming languages: Python, Java, TypeScript
Development methodologies: Agile Principles
Data management: data modeling, data warehousing, data governance
Cloud infrastructure: OCI, AWS, Azure, GCP
Operating systems: Linux, macOS
Scripting languages: Bash, Perl, Ruby
Containerization: Docker
API design: RESTful APIs, API gateways, API security
AI tools: chatbots, virtual assistants, predictive analytics
Database: MySQL, caching technologies (Redis, MemoryCache)
Systems architecture: data synchronization, fault tolerance, state management
Networking: general enterprise storage, networking, and computing experience
Benefits
Medical, dental, and vision insurance
Short‑term and long‑term disability coverage
Life insurance and AD&D
Supplemental life insurance
Health care and dependent care Flexible Spending Accounts
Pre‑tax commuter and parking benefits
401(k) savings and investment plan with company match
Paid time off, holidays, and paid sick leave
Paid parental leave and adoption assistance
Employee Stock Purchase Plan
Voluntary benefits: auto, homeowner, pet insurance
Hiring Information US: Hiring range $79,200 – $178,100 per year, plus potential bonus and equity.
About Oracle Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, or protected veterans’ status.
#J-18808-Ljbffr
Responsibilities
Work independently in ambiguous situations to ensure adherence to published standards and practices.
Design, develop, troubleshoot, and debug software programs for various cloud infrastructure components, including databases, applications, tools, and networks.
Take an active role in defining and evolving standard practices and procedures for software engineering, with a focus on AI‑driven development.
Design and develop software for tasks associated with developing, designing, and debugging software applications or operating systems, leveraging AI and ML techniques.
Lead the development of critical initiatives, including:
Design and implement spike detection mechanisms for provisioning failures to minimize operational disruptions using ML algorithms.
Expand integrations with Kafka to enable near real‑time actions supporting 1‑Day SLO objectives for hardware repairs, utilizing event‑driven architecture and stream processing.
Develop an automated ticket routing framework to streamline workflows, enhance efficiency, and reduce operational overhead, powered by NLP and ML.
Accelerate dedicated initiatives through collaborative efforts with cross‑functional teams and customers, applying AI‑driven insights and recommendations.
Harness the power of AI and ML to create innovative tools and frameworks that automate testing, simulate complex environments, and reproduce incidents, freeing up human ingenuity to focus on higher‑value tasks.
Collaborate and lead technical discussions across multiple teams to ensure seamless integrations and effective problem‑solving.
Provide direction and mentoring to junior engineers, sharing knowledge and expertise to promote growth and development.
Qualifications
Experience with Python, Java, or TypeScript.
Hands‑on experience in AI/ML, especially leveraging ML for operational monitoring and automation.
Strong background in Linux system programming and kernel‑level development.
Experience with Docker and container orchestration.
Knowledge of RESTful API design and API security.
Experience with cloud platforms (OCI, AWS, Azure, GCP) and familiar with cloud infrastructure concepts.
Ability to work independently and collaborate across teams.
Excellent problem‑solving and debugging skills.
Technical Skills
Programming languages: Python, Java, TypeScript
Development methodologies: Agile Principles
Data management: data modeling, data warehousing, data governance
Cloud infrastructure: OCI, AWS, Azure, GCP
Operating systems: Linux, macOS
Scripting languages: Bash, Perl, Ruby
Containerization: Docker
API design: RESTful APIs, API gateways, API security
AI tools: chatbots, virtual assistants, predictive analytics
Database: MySQL, caching technologies (Redis, MemoryCache)
Systems architecture: data synchronization, fault tolerance, state management
Networking: general enterprise storage, networking, and computing experience
Benefits
Medical, dental, and vision insurance
Short‑term and long‑term disability coverage
Life insurance and AD&D
Supplemental life insurance
Health care and dependent care Flexible Spending Accounts
Pre‑tax commuter and parking benefits
401(k) savings and investment plan with company match
Paid time off, holidays, and paid sick leave
Paid parental leave and adoption assistance
Employee Stock Purchase Plan
Voluntary benefits: auto, homeowner, pet insurance
Hiring Information US: Hiring range $79,200 – $178,100 per year, plus potential bonus and equity.
About Oracle Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, or protected veterans’ status.
#J-18808-Ljbffr