Network/Infrastructure Engineer- remote

Remote, USA
Posted Jun 14, 2026
Full-time

Company Overview:

We are a pioneering Infrastructure-as-a-Service (IaaS) company, focusing on delivering High-Performance Computing (HPC) solutions. Our cutting-edge data centers form the core of our operations, empowering us to offer unmatched computational resources to our global clientele. In line with our growth and the expansion of our services, we are on the lookout for a skilled and innovative Network/Infrastructure Engineer to strengthen our team.

Position Summary:

The Network/Infrastructure Engineer is pivotal in designing, implementing, and optimizing the network and compute infrastructure that powers our high-performance computing environments. This role encompasses network architecture design, operational management of complex BGP environments, HPC cluster optimization, and performance benchmarking. The successful applicant will collaborate closely with NVIDIA, deployment teams, and cross-functional engineering groups to ensure our infrastructure delivers exceptional performance and reliability. Travel to Data Centers located within the US may sometimes be required to support network deployments, troubleshooting, or performance optimization initiatives

Key Responsibilities:Network Design & Architecture:

Design physical and logical network topologies for high-performance computing environments supporting large-scale workloads

Maintain IP address management (IPAM) schemes ensuring efficient allocation and documentation

Create comprehensive network diagrams and technical documentation for current and future infrastructure

Collaborate with NVIDIA on Reference Architecture standards to ensure adherence to best practices and optimal configurations

Evaluate and recommend network technologies and solutions to meet evolving business requirements

Network Operations:

Configure and maintain BGP peering sessions with ISPs, partners, and internal autonomous systems

Monitor network health using observability tools, identifying and resolving performance bottlenecks

Respond to network incidents and perform advanced troubleshooting to minimize downtime

Coordinate IP block procurement and assignment, working with RIRs and transit providers

Maintain network security posture and implement changes following established protocols

Participate in on-call rotation for critical network incidents

Network Projects:

Develop detailed network BOMs (Bills of Materials) for new deployments in collaboration with deployment teams

Test and validate network configurations in lab environments prior to production deployment

Evaluate driver upgrades and perform compatibility testing across network hardware and software stacks

Design and implement network enhancements to improve performance, reliability, and scalability

Execute comprehensive network performance benchmarking using industry-standard tools and methodologies

Document project outcomes and create knowledge base articles for operational teams

HPC Cluster Management:

Optimize cluster performance and utilization through tuning of network fabric, storage, and compute resources

Test and validate deployment profiles for various HPC workloads and use cases

Configure and maintain high-speed interconnects (InfiniBand, RoCE) for low-latency communication

Work with infrastructure teams to ensure proper integration of compute, storage, and network components

Performance & Optimization:

Conduct rigorous benchmarking and performance analysis of HPC infrastructure using tools such as IOR, NCCL, and MLPerf

Test driver and firmware upgrades in HPC context, validating compatibility and performance impact

Troubleshoot complex compute node and interconnect issues affecting application performance

Document HPC-specific configurations and tuning parameters for various workload types

Identify and implement optimizations for network throughput, latency, and job completion times

Collaboration and Documentation:

Work closely with deployment engineers to ensure successful network implementation

Collaborate with infrastructure operations teams on incident response and problem resolution

Maintain comprehensive technical documentation including network diagrams, runbooks, and configuration standards

Participate in architecture review sessions and contribute to infrastructure planning

Mentor junior team members on networking concepts and HPC technologies

Safety and Compliance:

Adhere to strict data center safety protocols and operational standards during all on-site activities

Follow security best practices for network configuration and access control

Participate in regular safety training and briefings

Qualifications:

Bachelor's degree in Computer Science, Computer Engineering, Information Technology, or a related field preferred

3-5 years of experience in network engineering, with emphasis on large-scale data center or HPC environments

Expert-level knowledge of networking protocols including BGP, OSPF, VLANs, and routing fundamentals

Strong hands-on experience with enterprise network equipment from vendors such as Cisco, Arista, NVIDIA (Mellanox), or Juniper

Proficiency with high-speed interconnect technologies including InfiniBand, Ethernet RDMA (RoCE), and related protocols

Experience with network monitoring and observability tools (Prometheus, Grafana, Nagios, or similar)

Deep understanding of IP addressing, subnetting, and IPAM management

Demonstrated experience with HPC cluster architectures and job scheduling systems (Slurm, PBS, or similar)

Strong Linux system administration skills including shell scripting and automation

Experience with network performance testing tools and benchmarking methodologies

Familiarity with NVIDIA GPU computing architectures and networking solutions preferred

Knowledge of software-defined networking (SDN) concepts and implementation

Experience with configuration management tools (Ansible, Terraform, or similar) preferred

Strong analytical and troubleshooting skills with systematic problem-solving approach

Excellent documentation skills with attention to detail

Effective communication skills, both written and verbal, with ability to explain complex technical concepts to diverse audiences

Self-motivated with ability to work independently and manage multiple projects simultaneously

Availability to participate in on-call rotation and travel occasionally to data center locations as required

Preferred Certifications:

CCNP, CCIE, or equivalent networking certifications

NVIDIA networking certifications

Relevant cloud or data center certifications

Apply tot his job

More Remote Jobs