Infrastructure/GPU Engineer

Remote, USA
Posted Jun 12, 2026
Full-time
 

 

Cognizant is seeking a highly skilled hands-on Infrastructure Engineer with proven experience in the physical and technical deployment of AI-ready environments optimized for AI and machine learning workloads. This role focuses on NVIDIA DGX or similar systems, GPU-accelerated compute clusters, high-speed networking, and scalable storage solutions. The ideal candidate will have deep expertise in infrastructure design ,deployment, workload orchestration, and performance optimization in enterprise environments.

This is a remote role in the US. Salary range for this role is between $99,000 and $116,000 depending on skills and qualifications of the candidate. Applications will be accepted till 10/21/2025.

Key Responsibilities

System Design & Deployment

  • Help in rightsizing GPU investment 

  • Architect and deploy NVIDIA DGX systems and GPU-based compute clusters.

  • Design and implement scalable parallel filesystems (e.g., Lustre, BeeGFS, GPFS).

  • Integrate high-speed interconnects using InfiniBand, RoCE, and RDMA.

  • Collaborate on rack planning and airflow optimization.

    Cluster & Infrastructure Management

    • Configure and manage Slurm Workload Manager for job scheduling.

    • Deploy and maintain cluster orchestration tools

    • Automate provisioning using PXE boot, Terraform, Redfish, and Kubernetes.

    • Perform firmware updates, BIOS/IPMI/BMC configuration, and OS provisioning

    • Knowledge of Run.ai, ClearML or similar platform 

      Networking & Performance Optimization

      • Design and validate network topologies including IPMI, internal/external networks, and InfiniBand fabrics.

      • Optimize RDMA and RoCE configurations for low-latency, high-throughput data transfers.

      • Conduct performance benchmarking using GPU-Burn, NCCL, and NVSM.

        Monitoring & Troubleshooting

        • Implement system health checks and diagnostics across compute, storage, and network layers.

        • Troubleshoot hardware/software issues and ensure reliable infrastructure operation.

          Required Skills & Qualifications

          Technical Expertise

          • Deep understanding of NVIDIA DGX architecture, CUDA, and GPU compute.

          • Strong Linux system administration and shell scripting skills.

          • Experience with Slurm, parallel filesystems, and high-speed networking (InfiniBand/RDMA/RoCE).

          • Familiarity with containerization (Docker), orchestration (Kubernetes), and automation tools (Ansible, Redfish).

          •  

            Preferred Qualifications

            • Experience with BBCM, and DGX BasePOD/SuperPOD configuration

              Certifications by Nvidia or equivalent OEM.

More Remote Jobs