Principal Site Reliability Engineer

Remote, USA

Posted Jun 13, 2026

Full-time

Arcadia is dedicated to happier, healthier days for all. We believe that there is a better healthcare world – one powered by data. Our platform transforms complex, diverse data into a unified foundation for health, helping organizations deliver better care, boost revenue, and lower costs.

We’re a team of fiercely driven individuals committed to making healthcare more sustainable—and we’re looking for passionate people to help us get there.

For more information, visit arcadia.io

Why This Role is Important to Arcadia

Love building reliable systems, and want to make a difference?

Arcadia’s customers rely on us to securely process and deliver high-value healthcare insights. Reliability, availability, performance, and security are foundational to trust—especially when systems support critical workflows and handle PHI. As a Principal Site Reliability Engineer, you’ll set reliability strategy across teams, drive cross-cutting platform improvements, and ensure we can scale delivery without scaling operational burden.

What Success Looks Like

In 3 months

Build deep context on Arcadia’s platform, production risks, and operational practices. Participate in on-call/incident response and quickly improve signal quality for at least one critical domain (dashboards, alerts, traces, runbooks). Identify a high-leverage reliability initiative and align stakeholders on scope, success metrics, and milestones.

In 6 months

Establish SLOs/error budgets for key customer journeys, drive operational readiness standards for launches, and lead remediation for recurring incidents with measurable reductions in customer impact and MTTR. Deliver major toil-reduction improvements via automation and self-service workflows.

In 12 months

Own and execute a reliability program with cross-org impact (e.g., GitOps delivery guardrails, observability platform evolution, resilience/DR improvements, or secure infrastructure controls). Influence architecture decisions, establish org-wide operational standards, and mentor Staff engineers—raising the reliability and security bar across Arcadia.

For more information, visit arcadia.io

Why This Role is Important to Arcadia

Love building reliable systems, and want to make a difference?

What Success Looks Like

In 3 months

In 6 months

In 12 months

We’re a team of fiercely driven individuals committed to making healthcare more sustainable—and we’re looking for passionate people to help us get there.

For more information, visit arcadia.io
Why This Role is Important to Arcadia
Love building reliable systems, and want to make a difference?
Arcadia’s customers rely on us to securely process and deliver high-value healthcare insights. Reliability, availability, performance, and security are foundational to trust—especially when systems support critical workflows and handle PHI. As a Principal Site Reliability Engineer, you’ll set reliability strategy across teams, drive cross-cutting platform improvements, and ensure we can scale delivery without scaling operational burden.

What Success Looks Like
In 3 months
Build deep context on Arcadia’s platform, production risks, and operational practices. Participate in on-call/incident response and quickly improve signal quality for at least one critical domain (dashboards, alerts, traces, runbooks). Identify a high-leverage reliability initiative and align stakeholders on scope, success metrics, and milestones.
In 6 months
Establish SLOs/error budgets for key customer journeys, drive operational readiness standards for launches, and lead remediation for recurring incidents with measurable reductions in customer impact and MTTR. Deliver major toil-reduction improvements via automation and self-service workflows.
In 12 months
Own and execute a reliability program with cross-org impact (e.g., GitOps delivery guardrails, observability platform evolution, resilience/DR improvements, or secure infrastructure controls). Influence architecture decisions, establish org-wide operational standards, and mentor Staff engineers—raising the reliability and security bar across Arcadia.

What You'll Be Doing

Act as the technical leader for reliability for one or more domains; set direction and standards while remaining hands-on where it matters most

Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes

Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation

Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)

Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk

Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams

Lead operational readiness and reliability reviews for new features/architectural changes; reinforce non-functional requirements (availability, latency, security, cost)

Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services

Champion infrastructure security best practices for environments that handle PHI (least privilege, secrets management, auditability, and defense-in-depth)

Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation; raise reliability standards across teams

What You'll Bring

8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale

Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations

Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails

Strong GitOps experience with Argo CD

Apply Now

More Remote Jobs

Sr. Manager, APM Modeling & Analytics

Remote, USA

Full-time

Senior Product Manager

Remote, USA

Full-time

Process Engineer (Charlotte, NC, US, 28203)

Remote, USA

Full-time

Regional Business Manager Of Foodservice (Phoenix, AZ, US, 85001)

Remote, USA

Full-time

Sales Manager (Philadelphia, PA, US, 19019)

Remote, USA

Full-time

Director of Food Safety and Regulatory (Charlotte, NC, US, 28203)

Remote, USA

Full-time

Technical Implementation Partner - Inpatient

Remote, USA

Full-time

Technical Support Partner

Remote, USA

Full-time

Senior Client Success Partner - Surgical Growth & PCC

Remote, USA

Full-time

Senior Implementation Success Partner

Remote, USA

Full-time

Patient Support Case Manager

Remote, USA

Full-time

Risk Review Senior Analyst - Remote

Remote, USA

Full-time

Compliance, Risk, and Operations Manager Meritain (TPA)

Remote, USA

Full-time

**Experienced Customer Service Representative – Work from Home Opportunity with arenaflex**

Remote, USA

Full-time

Senior Full-Time Remote Data Entry Specialist in Texas - Opportunity to Make a Difference in Healthcare

Remote, USA

Full-time

Senior Manager, Technical Program Management - Capital One Software (Remote)

Remote, USA

Full-time

Experienced Data Entry Specialist – Remote Work Opportunity with arenaflex for Logistics and Supply Chain Management

Remote, USA

Full-time

Lead Data Strategist

Remote, USA

Full-time

Insurance Customer Service Intern

Remote, USA

Full-time

Creative Marketing Manager

Remote, USA

Full-time