Canvas Medical
Join to apply for the
Applied AI Software Engineer
role at
Canvas Medical
This range is provided by Canvas Medical. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base pay range $300,000.00/yr - $400,000.00/yr
Role We’re hiring an Applied AI Software Engineer to lead evaluations for agents in development and the post‑deployment fleet of agents operating in Canvas to automate work for our customers. You will help develop agents in Canvas using state of the art foundation model inference and fine‑tuning APIs along with our server‑side SDK. The server‑side SDK provides extensive tools and virtually all the context necessary for excellent agent performance. You’ll be responsible for designing and running rigorous evaluation experiments that measure performance, safety and reliability across a wide variety of clinical, operational and financial use cases.
This role is ideal for someone with deep experience evaluating LLM‑based agents at scale. You’ll create high‑fidelity unit evals and end‑to‑end evaluations, define expert‑determined ground truth outcomes, and manage iterations across model variants, prompts, tool use and context window configurations. Your work will directly inform model selection, fine‑tuning and go/no‑go decisions for AI features used in production settings.
You’ll collaborate with product, ML engineering and clinical informatics teams to ensure that Canvas's AI agents are not only capable, but trustworthy and robust under real‑world healthcare constraints. You will also work with technical product marketers and developer advocates to help our broader developer community and the broader market understand the uniquely differentiated value of agents in Canvas.
Who You Are
You have extensive hands‑on experience evaluating LLM‑based systems, including multi‑agent architectures and prompt‑based pipelines
You are deeply familiar with foundation model APIs (OpenAI, Claude, Gemini, etc.) and how to systematically benchmark agent performance using those models in applied settings
You care about correctness and reproducibility and have built or contributed to frameworks for automated evals, annotation pipelines and experiment tracking
You bring structure to ambiguity and know how to define “correctness” in complex, nuanced domains
You are comfortable collaborating across engineering, product and clinical subject matter experts
You are not afraid of complexity and are energized by the rigor required in healthcare deployments
What You’ll Do
Design and execute large‑scale evaluation plans for LLM‑based agents performing clinical documentation, scheduling, billing, communications and general workflow automation tasks
Build end‑to‑end test harnesses that validate model behavior under different configurations (prompt templates, context sources, tool availability, etc.)
Partner with clinicians to define accurate expected outcomes (gold standard) for performance comparisons in domains of clinical consequence, and partner with other subject matter experts in other non‑clinical domains
Run and replicate experiments across multiple models, parameters and interaction types to determine optimal configurations
Deploy and maintain ongoing sampling for post‑deployment governance of agent fleets
Analyze results and summarize trade‑offs for product and engineering stakeholders, as well as technical stakeholders among our customers and the broader market
Take ownership over internal eval tooling and infrastructure, ensuring speed, rigor and reproducibility
Identify and recommend candidates for reinforcement fine‑tuning or retrieval augmentation based on gaps identified in evals
What Success Looks Like At 90 Days
An expanded set of robust evaluation suites exists for all major AI features currently in development and in production
We have well‑defined correctness criteria for each workflow and a reliable source of expert‑determined outcome objects
Product and engineering teams have integrated your evaluation tools into their daily workflows
Evaluation results are clearly documented and reproducible, enabling trust in the performance trajectory
You have effectively engaged your marketing counterparts to translate your work into key messages to the market and to Canvas customers
Qualifications
5+ years of experience in applied machine learning or AI engineering, with a focus on evaluation and benchmarking
Proficiency with foundation model APIs and experience orchestrating complex agent behaviors via prompts or tools
Experience designing and running high‑throughput evaluation pipelines, ideally including human‑in‑the‑loop or expert‑labeled benchmarks
Superlative Python engineering skills and familiarity with experiment management tools and data engineering toolsets including SQL and database management
Familiarity with clinical or healthcare data is a strong plus
Experience with reinforcement fine‑tuning, model monitoring or RLHF is a plus
We encourage applicants to apply even if they do not meet every listed qualification
Benefits
Competitive Salary & Equity Package
Health Insurance
Home Office Stipend
401(k)
Paid Maternity/Paternity Leave (12 weeks)
Flexible/unlimited PTO
Equal Employment Opportunity Canvas Medical provides equal employment opportunities to all employees and applicants for employment without regard to race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
#J-18808-Ljbffr
Applied AI Software Engineer
role at
Canvas Medical
This range is provided by Canvas Medical. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base pay range $300,000.00/yr - $400,000.00/yr
Role We’re hiring an Applied AI Software Engineer to lead evaluations for agents in development and the post‑deployment fleet of agents operating in Canvas to automate work for our customers. You will help develop agents in Canvas using state of the art foundation model inference and fine‑tuning APIs along with our server‑side SDK. The server‑side SDK provides extensive tools and virtually all the context necessary for excellent agent performance. You’ll be responsible for designing and running rigorous evaluation experiments that measure performance, safety and reliability across a wide variety of clinical, operational and financial use cases.
This role is ideal for someone with deep experience evaluating LLM‑based agents at scale. You’ll create high‑fidelity unit evals and end‑to‑end evaluations, define expert‑determined ground truth outcomes, and manage iterations across model variants, prompts, tool use and context window configurations. Your work will directly inform model selection, fine‑tuning and go/no‑go decisions for AI features used in production settings.
You’ll collaborate with product, ML engineering and clinical informatics teams to ensure that Canvas's AI agents are not only capable, but trustworthy and robust under real‑world healthcare constraints. You will also work with technical product marketers and developer advocates to help our broader developer community and the broader market understand the uniquely differentiated value of agents in Canvas.
Who You Are
You have extensive hands‑on experience evaluating LLM‑based systems, including multi‑agent architectures and prompt‑based pipelines
You are deeply familiar with foundation model APIs (OpenAI, Claude, Gemini, etc.) and how to systematically benchmark agent performance using those models in applied settings
You care about correctness and reproducibility and have built or contributed to frameworks for automated evals, annotation pipelines and experiment tracking
You bring structure to ambiguity and know how to define “correctness” in complex, nuanced domains
You are comfortable collaborating across engineering, product and clinical subject matter experts
You are not afraid of complexity and are energized by the rigor required in healthcare deployments
What You’ll Do
Design and execute large‑scale evaluation plans for LLM‑based agents performing clinical documentation, scheduling, billing, communications and general workflow automation tasks
Build end‑to‑end test harnesses that validate model behavior under different configurations (prompt templates, context sources, tool availability, etc.)
Partner with clinicians to define accurate expected outcomes (gold standard) for performance comparisons in domains of clinical consequence, and partner with other subject matter experts in other non‑clinical domains
Run and replicate experiments across multiple models, parameters and interaction types to determine optimal configurations
Deploy and maintain ongoing sampling for post‑deployment governance of agent fleets
Analyze results and summarize trade‑offs for product and engineering stakeholders, as well as technical stakeholders among our customers and the broader market
Take ownership over internal eval tooling and infrastructure, ensuring speed, rigor and reproducibility
Identify and recommend candidates for reinforcement fine‑tuning or retrieval augmentation based on gaps identified in evals
What Success Looks Like At 90 Days
An expanded set of robust evaluation suites exists for all major AI features currently in development and in production
We have well‑defined correctness criteria for each workflow and a reliable source of expert‑determined outcome objects
Product and engineering teams have integrated your evaluation tools into their daily workflows
Evaluation results are clearly documented and reproducible, enabling trust in the performance trajectory
You have effectively engaged your marketing counterparts to translate your work into key messages to the market and to Canvas customers
Qualifications
5+ years of experience in applied machine learning or AI engineering, with a focus on evaluation and benchmarking
Proficiency with foundation model APIs and experience orchestrating complex agent behaviors via prompts or tools
Experience designing and running high‑throughput evaluation pipelines, ideally including human‑in‑the‑loop or expert‑labeled benchmarks
Superlative Python engineering skills and familiarity with experiment management tools and data engineering toolsets including SQL and database management
Familiarity with clinical or healthcare data is a strong plus
Experience with reinforcement fine‑tuning, model monitoring or RLHF is a plus
We encourage applicants to apply even if they do not meet every listed qualification
Benefits
Competitive Salary & Equity Package
Health Insurance
Home Office Stipend
401(k)
Paid Maternity/Paternity Leave (12 weeks)
Flexible/unlimited PTO
Equal Employment Opportunity Canvas Medical provides equal employment opportunities to all employees and applicants for employment without regard to race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
#J-18808-Ljbffr