NVIDIA
NVIDIA is hiring: Principal Software Engineer, Profiling Services in Austin
NVIDIA, Austin, TX, US
Senior Software Engineer, Profiling Services Join NVIDIA as a Senior Software Engineer on the Developer Tools Always-On Profiling (AON) team, where you'll be instrumental in designing, implementing, and leading our GPU performance analysis service for Machine Learning workloads. What Youll Be Doing
- Architect and build scalable systems for the AON profiling services core components, mastering inter-process communication, memory management, and low-overhead architectures.
- Promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems.
- Lead, mentor, and provide code reviews, shaping technical roadmaps and solving complex issues across the AON project.
- Translate user needs into requirements, design documents, and end?to?end feature development across applications, drivers, and hardware abstraction layers.
- Collaborate effectively with internal and external teams, communicating and coordinating across the broader profiling and ML ecosystem.
- BS or MS degree or equivalent experience in Computer Engineering, Computer Science, or a related field.
- 8+ years of software development in C, C++, and Python.
- 12+ years in system software design, operating systems fundamentals, computer architectures, and performance analysis.
- Strong interpersonal, verbal, and written communication skills.
- Expertise in profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data.
- In-depth CUDA API, runtime, and GPU architecture knowledge.
- Familiarity with ML frameworks such as PyTorch and JAX and performance analysis for AI training/inference.
- Experience developing and debugging complex multi?layered systems, including drivers.
- Proficiency designing APIs and interfaces for profiling tools.
- History of simplifying ill?defined problems and leading teams to implement solutions.
- Pioneering low?overhead profiling systems in complex distributed environments.
- Deep understanding of PyTorch internals and CUDA usage.
- Proficiency in analyzing profiling data and translating insights.
- Skill in translating customer needs into actionable use cases.
- Strong understanding of system security principles.