Logo
nTop

Senior Software Engineer, AI Evaluation Infra

nTop, New York, New York, us, 10261

Save Job

Senior Software Engineer, AI Evaluation Infra Join to apply for the

Senior Software Engineer, AI Evaluation Infra

role at

nTop

nTop is pioneering the future of engineering design with advanced software that pushes the boundaries of performance and delivers mission‑critical components faster than ever before. With a focus on Aerospace & Defense, we help programs deliver next‑gen aircraft faster, with fewer experts, and zero tolerance for failure. Our platform collapses months of configuration iteration into hours, letting teams explore thousands of validated variants instead of locking in the first concept. Teams cut development cycles by 50% and protect PWin with simulation‑backed proposals.

We are looking for Software Engineers to solve the hardest problems in physical design exploration. Our users are the world’s most demanding builders of physical goods—from aircraft and race cars to energy turbines. Your focus will be on developing software for deeply parametric engineering, physical simulation, and managing immense design spaces. If you’re motivated by solving tough engineering challenges alongside a team that learns and grows together, you’ll thrive at nTop.

nTop is hiring a Sr Software Engineer with a focus on Evaluation and Observability. You will own reliably measuring that our AI systems are ready for production. Design, implement, and maintain rigorous evaluation frameworks that ensure the accuracy, groundedness, and reliability of our systems. This role is NYC‑based hybrid and reports to the VP of Engineering.

What You’ll Do

Design evaluation frameworks: Develop metrics and benchmarks to systematically measure AI model performance, including accuracy, robustness, safety, and reliability.

Develop automated tools: Build automated evaluation pipelines that run tests at scale to assess AI performance under various conditions, including adversarial and edge‑case scenarios, and integrate with third‑party evaluation platforms or tools.

Implement human feedback loops: Design human annotation protocols and quality control mechanisms to incorporate human judgment into the evaluation process, especially for subjective tasks.

Analyze model behavior: Conduct in‑depth analysis to understand AI model performance, identify weaknesses, and pinpoint failure modes.

Build production systems: Extend or integrate external tools for evaluation into production environments by creating dashboards, alerts, and observability tools to monitor models after deployment.

Golden Dataset Management: Collaborate with domain experts to curate and manage high‑quality "Golden Question‑Answer‑Context" datasets essential for ground‑truth RAG evaluation.

Prompt and System Optimization: Translate evaluation results into clear, actionable recommendations for engineers to optimize LLM integration, prompt templates, and data chunking strategies.

Collaborate across teams: Work closely with product managers and software engineers to ensure evaluation methodologies align with business goals and to communicate technical findings to stakeholders.

Required Experience

2-3 years of professional experience in machine learning, MLOps, or software quality assurance, specifically focused on modern LLM applications.

Experience building, testing, or evaluating production‑grade RAG systems or other complex information retrieval/NLP systems.

Containerization & Infrastructure: Proven experience with Docker for containerizing applications, setting up consistent evaluation environments, and managing dependencies.

Programming & Tools: Expert proficiency in Python and experience with NLP/ML libraries and data processing tools.

MLOps and CI/CD: Practical experience integrating evaluation steps into automated testing and deployment pipelines for LLM‑based applications.

Preferred Experience

Domain Knowledge: Experience with AI/ML applications in CAD, simulation, engineering design, optimization, or manufacturing.

Search Relevance: Experience with classic information retrieval metrics, search engine optimization, or search relevance engineering.

Cloud Infrastructure: Experience deploying and scaling RAG components and evaluation pipelines using container orchestration tools like Kubernetes on cloud platforms (e.g., AWS, Azure, GCP).

LLM‑as‑a‑Judge: Experience designing and validating LLM‑based evaluation metrics for subjective quality assessment.

Data Engineering: Familiarity with ETL processes specifically for unstructured document ingestion and metadata enrichment.

Benefits

Outstanding PTO and leave policy

ISO options

Health, dental and vision plans

401k with generous matching

Annual stipend for continued career learning/development

Commuter benefits for NY‑based hires

Compensation Depending on experience, annual salary ranges between $145,000 - $190,000 plus options.

Seniority Level Mid‑Senior level

Employment Type Full‑time

Job Function Engineering and Information Technology

Software Development

#J-18808-Ljbffr