Oxmiq Labs
Get AI-powered advice on this job and more exclusive features.
Own microarchitecture and RTL for high-throughput GPU pipeline blocks. You’ll translate product goals into clear specs, deliver timing-clean RTL, and partner across verification, physical design, and—critically—the GPU architecture team to land measurable PPA wins.
Responsibilities
Microarchitecture: Define pipeline stages, flow control, queues/buffers, and interfaces; write concise design specs and lead reviews. RTL Design & PPA: Implement clean, synthesizable SystemVerilog; drive performance/power/area optimizations (datapaths, arbitration, backpressure, gating). Architecture collaboration: Work day-to-day with the architecture team to refine requirements, align on performance targets, and iterate on uArch choices with data. Verification Partnership: Build unit tests, create coverage plans, and author SVA; collaborate with UVM/formal to close corner cases. Quality & Sign-off: Run lint/CDC/RDC; support synthesis/STA and timing convergence; engage with PD/DFT for constraints and test. Bring-up & Debug: Support emulation/FPGA and silicon; instrument counters, analyze traces, and root-cause issues end-to-end. Communication & teamwork: Communicate trade-offs clearly across architecture, software, and PD; mentor peers and contribute to cross-IP integration. Qualifications
5+ years industry experience on desktop, mobile, or data center GPUs with real, shipped project ownership. Proficient in RTL design (SystemVerilog) and PPA optimization across performance, power, and area. Team player with strong understanding of overall GPU architecture and micro-architecture (SIMT/SIMD execution, scheduling and flow control, memory hierarchy). Hands‑on first: Able to build unit tests, drive coverage‑based verification (functional/code), and write robust SVA. Depth in at least one of the following domains: Instruction Scheduler (warp/wavefront issuing, fairness, QoS) Job Scheduler / Command Submission L1/L2 Cache Design (coherency, miss handling, prefetch) Command Processor (front‑end, MMIO, context management) Tensor Core Design (matrix/tensor datapaths, mixed precision) Required Skills
Nice to Have: Experience with ray tracing blocks, texture/sampler, ROP/blend, or MMU/TLB. Performance modeling, perf counter design, and trace analysis. EDA fluency: VCS/Questa, Verdi, Jasper/IFV, DC/Genus, PrimeTime/Tempus; emulation (Palladium/Veloce) or FPGA prototypes. Collaboration with compiler/LLVM and driver/runtime teams. What Success Looks Like (6–12 Months):
Tapeout‑quality RTL for one or more pipeline blocks with signed‑off PPA. Coverage closure against a clear plan (≥ target functional/code coverage) with SVA‑backed correctness. Demonstrated perf/power gains on target workloads vs. baseline. Seniority level
Mid‑Senior level Employment type
Full‑time Job function
Technology, Information and Internet
#J-18808-Ljbffr
Microarchitecture: Define pipeline stages, flow control, queues/buffers, and interfaces; write concise design specs and lead reviews. RTL Design & PPA: Implement clean, synthesizable SystemVerilog; drive performance/power/area optimizations (datapaths, arbitration, backpressure, gating). Architecture collaboration: Work day-to-day with the architecture team to refine requirements, align on performance targets, and iterate on uArch choices with data. Verification Partnership: Build unit tests, create coverage plans, and author SVA; collaborate with UVM/formal to close corner cases. Quality & Sign-off: Run lint/CDC/RDC; support synthesis/STA and timing convergence; engage with PD/DFT for constraints and test. Bring-up & Debug: Support emulation/FPGA and silicon; instrument counters, analyze traces, and root-cause issues end-to-end. Communication & teamwork: Communicate trade-offs clearly across architecture, software, and PD; mentor peers and contribute to cross-IP integration. Qualifications
5+ years industry experience on desktop, mobile, or data center GPUs with real, shipped project ownership. Proficient in RTL design (SystemVerilog) and PPA optimization across performance, power, and area. Team player with strong understanding of overall GPU architecture and micro-architecture (SIMT/SIMD execution, scheduling and flow control, memory hierarchy). Hands‑on first: Able to build unit tests, drive coverage‑based verification (functional/code), and write robust SVA. Depth in at least one of the following domains: Instruction Scheduler (warp/wavefront issuing, fairness, QoS) Job Scheduler / Command Submission L1/L2 Cache Design (coherency, miss handling, prefetch) Command Processor (front‑end, MMIO, context management) Tensor Core Design (matrix/tensor datapaths, mixed precision) Required Skills
Nice to Have: Experience with ray tracing blocks, texture/sampler, ROP/blend, or MMU/TLB. Performance modeling, perf counter design, and trace analysis. EDA fluency: VCS/Questa, Verdi, Jasper/IFV, DC/Genus, PrimeTime/Tempus; emulation (Palladium/Veloce) or FPGA prototypes. Collaboration with compiler/LLVM and driver/runtime teams. What Success Looks Like (6–12 Months):
Tapeout‑quality RTL for one or more pipeline blocks with signed‑off PPA. Coverage closure against a clear plan (≥ target functional/code coverage) with SVA‑backed correctness. Demonstrated perf/power gains on target workloads vs. baseline. Seniority level
Mid‑Senior level Employment type
Full‑time Job function
Technology, Information and Internet
#J-18808-Ljbffr