AI Technologies LLC.
Cloud Infrastructure Site Reliability Engineer
AI Technologies LLC., Berkeley Heights, New Jersey, us, 07922
Overview
Cloud Infrastructure Site Reliability Engineer (SRE) with expertise in multiple public cloud platforms. Responsible for operating infrastructure solutions following Google’s SRE principles to meet uptime, reliability, and performance targets, and to drive automation and continuous improvement across production environments. Collaborates with cross-functional teams to enhance cloud reliability and streamline processes through automation.
Responsibilities
Operate and maintain cloud infrastructure solutions across multiple public cloud platforms (AWS, GCP, Azure), focusing on uptime, reliability, performance, and scalability.
Drive automation, monitoring, incident response, and post-incident reviews to continuously improve production systems.
Collaborate with product, platform, and security teams to implement changes safely and efficiently, with an emphasis on observability and reliability engineering practices.
Qualifications
Bachelor’s degree in Computer Science, Engineering, or related technical field, or equivalent practical experience.
3+ years of software development experience with proficiency in at least one programming language (e.g., Python, Go, Java, C++).
Experience administering cloud platforms (AWS, GCP, Azure), including networking, security, containerization, storage, data management, and serverless technologies.
Strong understanding of Linux systems, networking fundamentals, virtualization, distributed systems, file systems, and system processes.
Deep understanding of observability tools (monitoring, alerting, logging) in cloud environments, with ability to set up and maintain dashboards, alerts, and logs.
Familiarity with CI/CD tools for automated testing, deployments, provisioning, and observability.
Ability to manage and respond to incidents, perform root cause analysis, and conduct post-mortem reviews.
Understanding of setting, monitoring, and maintaining Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) for system reliability.
Additional Qualifications
Experience in enterprise-scale financial services or regulated industries is a plus.
5+ years in SRE, DevOps, infrastructure, or cloud engineering roles, preferably with large-scale distributed systems.
Strong problem-solving, troubleshooting, and communication skills; ability to lead technical projects or mentor junior engineers.
Certifications: DevOps, SRE, CSRE, or related engineering certifications.
Equal Opportunity AI TECHNOLOGIES LLC is an equal opportunity employer inclusive of female, minority, disability and veterans (M/F/D/V). Hiring, promotion, transfer, compensation, benefits, discipline, termination, and all other employment decisions are made without regard to race, color, religion, sex, sexual orientation, gender identity, age, disability, national origin, citizenship/immigration status, veteran status or any other protected status. AI TECHNOLOGIES LLC will comply with applicable laws related to equal opportunity and employment eligibility requirements. No posting or employment decision will require U.S. citizenship or lawful permanent residency except as necessary to comply with law, regulation, or government contracts.
#J-18808-Ljbffr
Responsibilities
Operate and maintain cloud infrastructure solutions across multiple public cloud platforms (AWS, GCP, Azure), focusing on uptime, reliability, performance, and scalability.
Drive automation, monitoring, incident response, and post-incident reviews to continuously improve production systems.
Collaborate with product, platform, and security teams to implement changes safely and efficiently, with an emphasis on observability and reliability engineering practices.
Qualifications
Bachelor’s degree in Computer Science, Engineering, or related technical field, or equivalent practical experience.
3+ years of software development experience with proficiency in at least one programming language (e.g., Python, Go, Java, C++).
Experience administering cloud platforms (AWS, GCP, Azure), including networking, security, containerization, storage, data management, and serverless technologies.
Strong understanding of Linux systems, networking fundamentals, virtualization, distributed systems, file systems, and system processes.
Deep understanding of observability tools (monitoring, alerting, logging) in cloud environments, with ability to set up and maintain dashboards, alerts, and logs.
Familiarity with CI/CD tools for automated testing, deployments, provisioning, and observability.
Ability to manage and respond to incidents, perform root cause analysis, and conduct post-mortem reviews.
Understanding of setting, monitoring, and maintaining Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) for system reliability.
Additional Qualifications
Experience in enterprise-scale financial services or regulated industries is a plus.
5+ years in SRE, DevOps, infrastructure, or cloud engineering roles, preferably with large-scale distributed systems.
Strong problem-solving, troubleshooting, and communication skills; ability to lead technical projects or mentor junior engineers.
Certifications: DevOps, SRE, CSRE, or related engineering certifications.
Equal Opportunity AI TECHNOLOGIES LLC is an equal opportunity employer inclusive of female, minority, disability and veterans (M/F/D/V). Hiring, promotion, transfer, compensation, benefits, discipline, termination, and all other employment decisions are made without regard to race, color, religion, sex, sexual orientation, gender identity, age, disability, national origin, citizenship/immigration status, veteran status or any other protected status. AI TECHNOLOGIES LLC will comply with applicable laws related to equal opportunity and employment eligibility requirements. No posting or employment decision will require U.S. citizenship or lawful permanent residency except as necessary to comply with law, regulation, or government contracts.
#J-18808-Ljbffr