Logo
nTopology

Senior Software Engineer, AI Evaluation Infra

nTopology, New York, New York, us, 10261

Save Job

Overview

nTop

is pioneering the future of engineering design with our advanced software that pushes the boundaries of performance and delivers mission-critical components faster than ever before. With a focus on Aerospace & Defense where programs face an impossible reality: deliver next-gen aircraft faster, with fewer experts, and zero tolerance for failure. nTop changes how aircraft get designed. Our platform collapses months of configuration iteration into hours, letting teams explore thousands of validated variants instead of locking in the first concept. Teams cut development cycles by 50% and protect PWin with simulation-backed proposals. Defense primes and startups choose nTop when mission success isn\'t negotiable We are looking for Software Engineers to solve the hardest problems in physical design exploration. Our users are the world\'s most demanding builders of physical goods—from aircrafts and race cars to energy turbines. Your focus will be on developing software for deeply parametric engineering, physical simulation, and managing immense design spaces. We reduce the crippling cost of late-stage design changes, making building with atoms as fast and agile as building with bits. If you\'re motivated by solving tough engineering challenges alongside a team that learns and grows together, you\'ll thrive at nTop. We\'re seeking teammates who are eager to experiment, innovate, and make a meaningful impact with technology. nTop is hiring a Sr Software Engineer with a focus on Evaluation and Observability. You will own reliably measuring that our AI systems are ready for production. Design, implement, and maintain the rigorous evaluation frameworks that ensure the accuracy, groundedness, and reliability of our system. This role is NYC based-hybrid and reports to the VP of Engineering. What You\'ll Do

As our Sr Software Engineer in AI Evals Infra & Observability, you will be the quality gate for our AI systems, focusing on the entire data-to-answer pipeline. Your responsibilities will include: Design evaluation frameworks:

Develop metrics and benchmarks to systematically measure AI model performance, including accuracy, robustness, safety, and reliability. Develop automated tools:

Build automated evaluation pipelines that run tests at scale to assess AI performance under various conditions, including adversarial, edge-case scenarios and/or integrate with 3rd party eval platforms/tools Implement human feedback loops:

Design human annotation protocols and quality control mechanisms to incorporate human judgment into the evaluation process, especially for subjective tasks. Analyze model behavior:

Conduct in-depth analysis to understand AI model performance, identify weaknesses, and pinpoint failure modes. Build production systems:

Extend or integrate external tools for evaluation process to production environments by creating dashboards, alerts, and observability tools to monitor models after deployment. Golden Dataset Management:

Collaborate with domain experts to curate and manage high-quality

"Golden Question-Answer-Context" datasets

essential for ground-truth RAG evaluation. Prompt and System Optimization:

Translate evaluation results into clear, actionable recommendations for Engineers to optimize the LLM integration, prompt templates, and data chunking strategies. Collaborate across teams:

Work closely with product managers and software engineers to ensure that evaluation methodologies align with business goals and to communicate technical findings to stakeholders. Required Experience

We are looking for a hands-on engineer with

2-3 years

of professional experience in machine learning, MLOps, or software quality assurance, specifically focused on modern LLM applications. Experience building, testing, or evaluating production-grade RAG systems or other complex information retrieval/NLP systems. Containerization & Infrastructure: Proven experience with Docker for containerizing applications, setting up consistent evaluation environments, and managing dependencies. Programming & Tools: Expert proficiency in Python and experience with NLP/ML libraries and data processing tools. MLOps and CI/CD:

Practical experience integrating evaluation steps into automated testing and deployment pipelines for LLM-based applications. Preferred Experience

These are "nice-to-have" experiences that would make a candidate an even stronger fit: Domain Knowledge: Experience with AI/ML applications in CAD, simulation, engineering design, optimization, or manufacturing. Search Relevance: Experience with classic information retrieval metrics, search engine optimization, or search relevance engineering. Cloud Infrastructure: Experience deploying and scaling RAG components and evaluation pipelines using container orchestration tools like Kubernetes on cloud platforms (e.g., AWS, Azure, GCP). LLM-as-a-Judge: Experience designing and validating LLM-based evaluation metrics for subjective quality assessment. Data Engineering: Familiarity with ETL processes specifically for unstructured document ingestion and metadata enrichment. Benefits

Outstanding PTO and leave policy ISO options Healthcare: Medical Dental and Vision plans 401k with generous matching Annual stipend for continued career learning/ development Commuter benefits for NY based hires Compensation

Depending on experience, annual salary ranges between $145,000 - $190,000 plus options

#J-18808-Ljbffr