Cloud Infrastructure Engineer

Remote, USA
Posted Jun 12, 2026
Full-time

About the Role

We're seeking aSenior Cloud Infrastructure Engineer for a 3-month contract engagement to join our Infrastructure team and take ownership of operational excellence and SRE toil work. This is a remote, hands-on, high-velocity role where you'll keep the lights on and reduce operational burden from day one.

This contract position exists to free up our existing team to focus on roadmap initiatives. By taking over day-to-day operational work and SRE toil, you'll enable one of our current engineers to tackle strategic projects. Your success means the platform runs smoothly while the team makes forward progress on critical initiatives.

You'll bring deep operational expertise to manage production systems, respond to operational needs, and—critically—build systems and automation that reduce toil over time. This role is ideal for an experienced SRE or infrastructure engineer who thrives on operational work, can quickly understand production systems, and naturally improves everything they touch.

Our platform powers genomics and laboratory workflows for customers in highly regulated environments. You'll work with modern infrastructure tooling (HashiCorp stack, AWS, Kubernetes patterns) while ensuring we meet the reliability, security, and compliance requirements our customers depend on.

What You'll Do

Operational Excellence & SRE Work (60%)

Keep the lights on: Monitor, respond to, and resolve production incidents and operational issues

Handle toil work: Manage routine operational tasks that currently consume team capacity (deployments, configuration changes, access management, maintenance windows)

Participate in on-call rotation: Share responsibility for after-hours production support

Respond to support escalations: Work with support and development teams to troubleshoot and resolve platform issues

Manage production changes: Execute and validate infrastructure changes in production environments

Maintain operational runbooks: Update and improve documentation for operational procedures

Perform system maintenance: Handle patches, upgrades, certificate renewals, and other recurring operational tasks

Ensure service reliability: Monitor system health, respond to alerts, and maintain SLAs

Toil Reduction & Automation (30%)

Identify automation opportunities: Spot repetitive manual work and build automation to eliminate it

Improve operational tooling: Create scripts, utilities, and self-service tools to reduce operational burden

Enhance monitoring and alerting: Improve observability to catch issues before they become incidents

Streamline deployment processes: Reduce friction and manual steps in release and deployment workflows

Build self-service capabilities: Enable developers to handle routine tasks without infrastructure team involvement

Implement infrastructure-as-code: Convert manual procedures into automated, repeatable infrastructure code (Terraform)

Document systems improvements: Leave behind improved runbooks, automation, and processes

Measure and track toil: Help quantify operational burden and demonstrate reduction over time

Collaboration & Knowledge Transfer (10%)

Enable roadmap progress: By handling operational work, free up permanent team members for strategic initiatives

Collaborate with development teams: Support their infrastructure needs and unblock their work

Document tribal knowledge: Capture operational knowledge and procedures that exist only in people's heads

Conduct handoffs: Provide clear documentation and knowledge transfer for systems and automation you build

Participate in team rituals: Standups, retrospectives, and planning to stay aligned with team priorities

What We're Looking For

Required

5-8 years of experience in infrastructure, platform, SRE, or DevOps engineering

Strong operational background: Experience managing production systems and handling incidents

Proven toil reduction skills: Track record of identifying repetitive work and automating it away

Strong expertise with cloud infrastructure (AWS strongly preferred)

Proficiency with infrastructure-as-code (Terraform required)

Experience with container orchestration (Kubernetes, Nomad, or similar)

Experience with service mesh and service discovery (Consul, Istio, or similar)

Experience with secrets management (Vault, Secrets Manager, or similar)

Strong understanding of monitoring, alerting, and observability

Comfortable with on-call work: Experience with incident response and production support

Proven ability to onboard quickly and become productive in new environments

Strong troubleshooting skills: Can diagnose complex system issues under pressure

Self-directed work style: Minimal supervision required for operational work

Bias for automation: Natural instinct to eliminate manual work

Preferred

Experience with HashiCorp tooling (Terraform, Nomad, Consul, Vault)

Experience in healthcare, life sciences, or other regulated industries

Familiarity with compliance frameworks (HIPAA, HITRUST, SOC2, ISO 27001)

Experience with observability platforms (Datadog, Grafana, Prometheus)

Experience supporting Java/Spring applications

Background in genomics, bioinformatics, or laboratory systems

Experience with GitOps workflows and CI/CD automation

Previous contract or consulting experience with rapid onboarding

Experience quantifying and measuring toil (e.g., SLO/SLI frameworks)

What Success Looks Like

First 2 Weeks

Complete onboarding and gain access to all systems

Shadow on-call rotation and understand incident response procedures

Take ownership of routine operational tasks (deployments, configuration changes, monitoring)

Build relationships with development and support teams

Begin handling operational requests and support escalations independently

Identify your first 2-3 toil reduction opportunities

First Month

Fully integrated into operational workflows—handling day-to-day platform operations with minimal guidance

Successfully participating in on-call rotation

Delivered at least 1-2 automation improvements that reduce manual work

Team members report they have more time for roadmap work due to your operational coverage

Demonstrated ability to troubleshoot and resolve production issues independently

Improved at least one operational runbook or procedure

End of Contract (3 Months)

Platform reliability maintained or improved: No degradation in service quality or uptime

Toil measurably reduced: Team can point to 3-5 significant automation or process improvements you delivered

Roadmap progress enabled: At least one permanent team member successfully completed a strategic initiative because you freed up their capacity

Operational systems improved: Left behind better monitoring, alerting, documentation, and automation

Knowledge transfer complete: Documented all operational improvements and handed off systems/automation cleanly

Team capacity increased: Reduced the time the permanent team spends on operational toil by a measurable amount (target: 20-30% reduction)

Optionally: Position identified for contract extension if operational coverage continues to be valuable

More Remote Jobs