Metropolis Technologies

Staff Software Engineer, Reliability

Metropolis Technologies, New York, New York, us, 10261

Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout‑free experiences in the real world. We are reimagining parking to enable millions of consumers to just “drive in and drive out.” Our vision is to allow people to transact with speed, ease, and convenience unmatched online, and to power checkout‑free experiences everywhere. The goal is to return time to people and make everyday experiences of living, working and playing remarkable.

Who you are We are building a hyperscaler company and need someone to own reliability across the entire Metropolis platform. As a Staff or Senior Software Engineer focused on Reliability, you’ll establish and drive comprehensive reliability practices that ensure system availability, resilience, and observability for our mission‑critical mobility infrastructure serving millions of transactions. This is your opportunity to build reliability from first principles: architect fail‑over systems, implement chaos‑engineering practices, and improve the observability foundation that will enable scaling while maintaining 99.9%+ uptime.

You’ll be the technical owner of our reliability posture, working on everything from multi‑region fail‑over architectures to incident‑response workflows to SLO‑based alerting strategies.

Our platform handles real‑time payment processing, customer authentication, and parking facility operations – systems that cannot go down. You’ll tackle challenges such as external service fail‑over, dependency mirroring to prevent upstream outages, database replication, automatic promotion, and building monitoring and alerting infrastructure that detects and responds to issues in minutes, not hours.

If you're energized by the challenge of ensuring system reliability at scale, building robust fail‑over mechanisms, implementing comprehensive observability, and establishing practices that prevent incidents before they occur, this role is for you. You'll work alongside highly technical teams, influencing architecture decisions and setting reliability standards that affect every service we build.

What you’ll do

Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services

Design and implement automatic fail‑over mechanisms for critical external dependencies with circuit breakers, retry policies, and degraded‑mode operations

Architect and build active‑passive or active‑active regional deployment strategies with database replication, automated fail‑over, and DNS‑based traffic routing including disaster‑recovery planning and testing

Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO‑based alerting, on‑call rotation, escalation policies, and build service‑health dashboards that show customer impact

Own the incident‑management process including workflows, tooling, post‑mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution

Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos‑engineering practices

Build and maintain local mirrors for critical dependencies with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages

What we’re looking for

8+ years of backend software engineering experience with a deep focus on distributed systems and platform infrastructure

Expert‑level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling; Scala experience is a big plus

Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)

Strong systems thinking with proven ability to design and implement large‑scale, high‑availability distributed systems that handle significant load

Observability expertise with hands‑on production experience with metrics, logging, tracing, and alerting systems in high‑load environments

Database and data systems knowledge including relational databases, event streaming, caching strategies, and data consistency patterns

Experience with AI‑powered development tools for enhanced productivity – context engineering in particular

Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams local to New York City, Seattle, or Los Angeles area

While not required, these are a plus

SRE or Reliability Engineering experience at companies known for operational excellence or high‑growth startups where you built reliability practices from the ground up

Incident response leadership including experience building incident‑management processes, conducting blameless post‑mortems, and driving MTTR reduction initiatives in production environments

Chaos engineering experience with tools such as Chaos Monkey, Gremlin, or similar, including designing and executing game days and failure‑injection testing

Performance optimization experience with profiling, benchmarking, capacity planning, and system tuning at hyperscale

Open source contributions or technical blog writing that demonstrates depth of expertise in reliability engineering, distributed systems, or production operations

Our Stack

Languages + Frameworks: TypeScript, React, Scala (principally), Java (limited)

Datastores: MySQL, PostgreSQL, Snowflake

Cloud: AWS

Version control: Git & GitHub

AI Tooling: Copilot on GitHub

Observability: Datadog

When you join Metropolis, you’ll join a team of world‑class product leaders and engineers, building an ecosystem of technologies at the intersection of parking, mobility, and real‑estate. Our goal is to build an inclusive culture where everyone has a voice and the best idea wins. You will play a key role in building and maintaining this culture as our organization grows.

Affiliate salary: $180,000.00 USD to $200,000.00 USD annually. Base salary is one component of Metropolis’s total compensation package, which may also include access to or eligibility for healthcare benefits, a 401(k) plan, short‑term and long‑term disability coverage, basic life insurance, a lucrative stock option plan, bonus plans and more.

Metropolis values in‑person collaboration to drive innovation, strengthen culture, and enhance the Member experience. Our corporate team members hold to an office‑first model, requiring employees to be on‑site at least four days a week, fostering organic interactions that spark creativity and connection.

Metropolis may utilize an automated employment decision tool (AEDT) to assess or evaluate your candidacy for employment or promotion. AEDTs are to assist in assessing a candidate’s application relative to the required job qualifications and responsibilities listed in the job posting.

As part of this process, Metropolis retains data relevant to your candidacy, including personal information, for a period that is reasonably necessary for the use of the tool. If you are hired for the position, your data may become part of your employee records.

Metropolis Technologies is an equal opportunity employer. We make all hiring decisions based on merit, qualifications, and business needs, without regard to race, color, religion, sex (including gender identity, sexual orientation, or pregnancy), national origin, disability, veteran status, or any other protected characteristic under federal, state, or local law.

Seniority Level Mid‑Senior level

Employment Type Full‑time

Job Function Engineering and Information Technology

Industries Consumer Services and Facilities Services

#J-18808-Ljbffr