Network Reliability Engineer

Remote, USA
Posted Jun 13, 2026
Full-time

#HPC #AI #GPU #CLUSTERS

 

YOUR DAILY ROUTINE

- Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams

- Participate in an on-call rotation to handle incidents and ensure service continuity

- Implement and maintain observability solutions to monitor AI infrastructure and application health

- Contribute to AI infrastructure lifecycle management across different environments and countries

- Promote and apply best practices in terms of stability, resiliency, scalability, and security

- Maintain clear technical documentation for tools and procedures

- Contribute to system and tool evolution based on production feedback

- Collaborate closely with development teams to ensure infrastructure readiness- Participate in team rituals and knowledge-sharing initiatives

 

ABOUT YOU

 

🎯 SOFTSKILLS : 

- Proactive and solution-oriented mindset

- Passion for automation and continuous improvement

- Strong collaboration and communication skills

- Ability to work independently and in a team

- Willingness to mentor and share knowledge

 

💻 HARDSKILLS : 

- Experience with Go or Python 

- Strong scripting skills (Bash, Python)

- Hands-on experience with Linux systems (Ubuntu/Debian)

- Preferred hands-on experience with GPU & HPC infrastructure 

- Knowledge of networking (TCP/IP, DNS, BGP, load-balancing, IPv6, etc.)

- Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)

- Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.)

- Experience managing relational databases (MariaDB)

- Understanding of CI/CD pipelines (GitLab)

- Comfortable with English (written and spoken)

 

\n

\n200 zł - 250 zł an hour
\n

More Remote Jobs