Logo
Energy Jobline ZR

Principal Site Reliability Engineer in Miami

Energy Jobline ZR, Miami, Florida, us, 33222

Save Job

Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub. We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy and engineering jobs, and work with the leading energy companies worldwide.

We focus on the Oil & Gas, Renewables, Engineering, Power, and Nuclear markets as well as emerging technologies in EV, Battery, and Fusion. We are committed to ensuring that we offer the most exciting career opportunities from around the world for our jobseekers.

Job DescriptionJob Description About Iru Iru is the AI-powered security & IT platform used by the world’s fastest-growing companies to secure their users, apps, and devices. Built for the AI era, Iru unifies & access, endpoint security & management, and compliance automation—collapsing the stack and giving IT & security time and control back. Iru is backed by some of the smartest investors in tech—General Catalyst, Tiger Global, Felicis, Greycroft, and First Round Capital. In July 2024, Iru raised $100 million from General Catalyst, valuing the company at $850 million. Customers include Notion, Cursor, Lovable, Replit, and Mercor, and Iru partners with industry leaders such as ServiceNow and AWS. Iru was named to Forbes’ America’s Best Startup Employers 2025 list for employee engagement and satisfaction. The Opportunity As a Principal Site Reliability Engineer at Kandji, you will play a critical role in ensuring the reliability, scalability, and performance of our platform. In this strategic position, you’ll work cross-functionally to build and evolve the systems, tools, and processes that keep our services resilient and performant—especially as we scale to meet the demands of a growing customer base. You’ll bring a deep understanding of distributed systems, incident management, observability, and automation. Your experience with AWS, Kubernetes, and Infrastructure-as-Code (Terraform ) will help drive efforts to proactively identify and eliminate reliability risks, reduce toil through automation, and establish engineering best practices across teams. We’re looking for a seasoned engineer with both technical depth and a strategic mindset—someone who can guide long-term reliability efforts, lead postmortems and systemic remediation, and mentor others in SRE principles. This role provides the opportunity to shape the culture and architecture of reliability at Kandji, partnering closely with engineering, infrastructure, and product teams to build systems that are not only functional, but fault-tolerant and maintainable.

How You Will Make a Difference Day to Day:

Reliability Strategy & Resilience Engineering : Design and implement fault-tolerant, scalable, and highly available systems across our AWS-hosted platform to ensure reliability under load and failure conditions.

Service Ownership & Runbook Maturity : Partner with engineering teams to define and uphold SLIs/SLOs, perform root cause analyses, and drive post-incident reviews with a focus on long-term systemic improvements. Run recurring reliability reviews, and mature incident response practices including alert quality, runbooks, and failure simulations.

Automation & Tooling : Build and maintain automation for deployment, incident response, and remediation workflows to reduce manual toil and increase operational efficiency.

Secure Systems Design : Hands‑on experience implementing DevSecOps practices including secure IaC, policy-as-code, and embedding controls in pipelines or platform abstractions.

Observability & Monitoring : Champion the development of comprehensive observability solutions—including metrics, logging, tracing, and alerting—to enable proactive detection and resolution of issues.

Infrastructure as Code : Contribute to and improve our Terraform‑based infrastructure management, enabling consistent, auditable, and repeatable infrastructure deployments.

Capacity Planning, FinOps & Performance : Lead efforts in system tuning, load testing, and capacity forecasting to support our scaling platform and avoid bottlenecks before they occur. Lead efforts to monitor and optimize cloud costs across environments. Design and advocate for architectural trade‑offs that balance cost, performance, and reliability.

Cross‑Functional Reliability Coaching : Embed reliability thinking into engineering and product workflows. Run architecture reviews, failure simulations, and training to elevate operational discipline.

Mentorship & Leadership : Mentor engineers across the organization in SRE best practices, incident response, and reliability design patterns, helping build a culture of ownership and operational excellence across the company.

We’d love to hear from you if you have:

Experience : 10+ years in Site Reliability Engineering, DevOps, Infrastructure or related roles, with a proven track record of improving system reliability and scaling distributed systems in cloud environments (preferably AWS).

Technical Proficiency : Deep expertise in Infrastructure as Code (Terraform strongly ), Kubernetes, and container orchestration at scale; strong background in automation, scripting (e.g., Python, Go, or Bash), and CI/CD pipelines.

Reliability Engineering Mindset : Experience defining and maintaining SLOs/SLIs, leading incident response and postmortems, and applying SRE principles to reduce toil and improve system reliability. Deep familiarity with chaos engineering, failure mode analysis, and designing systems for graceful degradation under partial failure.

Observability & Performance : Strong understanding of modern observability stacks (e.g., Datadog, Prometheus, Grafana, OpenTelemetry) and performance tuning for distributed systems.

Security & Compliance Awareness : Solid understanding of security and compliance in cloud environments, with experience implementing secure‑by‑default infrastructure patterns. Familiar with secure infrastructure design, cloud compliance requirements (SOC2, ISO27001, ISO42001), and embedding DevSecOps into delivery workflows.

Problem Solving : Skilled in diagnosing complex, multi‑layered production issues and implementing pragmatic, long‑term solutions.

Influence & Communication : Excellent written and verbal communication skills with the ability to clearly articulate reliability trade‑offs and influence engineering teams toward better operational outcomes. Trusted collaborator with product, infra, security, and GTM leaders.

Location : Required to work on‑site 5x a week in our Miami office (Coral Gables).

Benefits & Perks Competitive salary 100% individual and dependent medical + dental + vision coverage 401(K) with a 4% company match 20 days PTO Flexibility to work from anywhere for up to 30 days per year Iru Wellness Week the first week in July Equity for full‑time employees Lunch stipend provided Monday through Friday Up to 16 weeks of paid leave for new parents Paid Family and Medical Leave Modern Health mental health benefits for individuals and dependents Fertility benefits Working Advantage employee discounts Onsite fitness center Free parking Exciting opportunities for career growth

Iru is proud to be an equal opportunity employer committed to and in the workplace. Qualified applicants will be considered for employment without regard to , , , , , , , , physical or mental , protected veteran or military status or any other status protected by applicable law.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

If you are interested in applying for this job please press the Apply Button and follow the application process. Energy Jobline wishes you the very best of luck in your next career move.

#J-18808-Ljbffr