GPU Kernel Developer – AI/ML

Remote, USA
Posted Jun 13, 2026
Full-time

Role Summary

We are seeking expert-level GPU Software Engineers to support a high-visibility platform initiative within the Maya program, focused on building software tooling on top of a custom compiler and SDK.

The role involves developing, optimizing, and porting GPU kernels and AI workloads to a specialized hardware platform.

This is a critical and time-sensitive engagement with immediate onboarding expectations and long-term roadmap alignment (~18 months).

Key Responsibilities

• Develop GPU kernels for specialized hardware platforms using PyTorch/Triton frameworks

• Build software solutions leveraging custom compiler and SDK capabilities

• Design and implement kernel-level optimizations to control hardware execution behavior

• Port open-source AI/ML models to custom SDK environments

• Port and adapt high-performance computing benchmarks and stress workloads such as:
• Linpack (High Performance Linpack)
• BERT/benchmark-style workloads (referred as “Babu bench”)
• • Develop stress testing and validation workloads aligned to hardware behaviour and platform validation
• • Support testing and stress testing of current and next-generation hardware platforms
• • Collaborate closely with platform architects and compiler teams to enhance system capabilities

Core Technical Skills (Must-Have)

Programming & Frameworks

• Python

• C/C++ (systems-level programming)

• PyTorch

• Triton (Triton language / kernel development)

GPU & Systems Expertise

• GPU kernel development (mandatory and critical)

• Strong understanding of GPU architecture and compute optimization

• Experience with compiler-based optimizations / runtime execution layers

• Experience with custom SDKs or hardware abstraction layers

Performance & Workloads

• Experience in:
• GEMM kernel development (matrix multiplication kernels)
• Porting ML models to new hardware platforms
• Performance tuning and stress testing at system level

Nice-to-Have Skills

• Experience working with custom silicon / hardware platforms

• Exposure to high-performance computing (HPC) workloads

• Familiarity with:
• Linpack benchmarks
• AI workload benchmarking tools
• • Experience in compiler optimization ecosystems

Engagement Model & Structure

• Number of roles: 3 developers (initial hiring may start with 2)

• Location flexibility:
• Onsite / Offshore / Hybrid mix allowed
• • Timeline:
• Immediate start required
• • Duration:
• ~18 months program duration with phased platform evolution

Key Differentiators (Critical Expectation)

• This is NOT a DevOps / support / debugging role

• Requires deep hands-on engineering expertise in:
• Kernel programming
• GPU workloads
• ML framework internals
• • Candidates must demonstrate build-level competence, not just theoretical knowledge

Apply tot his job

More Remote Jobs