Senior Site Reliability Engineer
If you want to work at the intersection of serious engineering craft and meaningful patient outcomes, and to build practices that a growing team will rely on for years, this is that role.
- Define and maintain SLO and error budget frameworks across multiple services, working directly with product engineers to make reliability expectations concrete and actionable rather than aspirational.
- Design and evolve the observability architecture across the platform, ensuring the engineering team has genuine insight into system behaviour during the Django-to-Next.js migration and beyond.
- Identify systemic gaps in monitoring, alerting, and incident response before they surface as patient-facing incidents, and drive the work required to close them.
- Lead post-incident reviews that go beyond immediate fixes, producing changes to architecture, runbooks, on-call processes, or delivery practices that reduce the likelihood and impact of recurrence.
- Write infrastructure-as-code and automation that sets a quality bar for the team, reviewing infrastructure contributions from product engineers and junior SREs with direct, specific feedback.
- Keep product engineering teams unblocked on reliability concerns by being a visible, proactive partner in delivery: attending design conversations, raising reliability risks early, and pushing back constructively when decisions create patient risk without a conscious trade-off.
- Improve how the team operates on reliability over time, including on-call processes, reliability review checkpoints in the delivery cycle, and the quality of documentation product engineers use to understand what is expected of their services.
- 6+ years in SRE, DevOps, or backend roles with production ownership.
- Experience operating and improving reliability of distributed, customer-facing systems.
- Strong cloud and infrastructure-as-code experience (AWS, Terraform, or similar).
- Hands-on experience with SLOs, SLIs, and error budgets.
- Solid observability experience (metrics, logging, tracing).
- Experience leading incidents and post-incident reviews that drive systemic change.
- Strong scripting/programming skills (e.g. Python, Go, TypeScript).
- Ability to identify risks early and influence cross-team engineering decisions.
- Clear communication and documentation skills.
- Experience supporting system migrations or major architectural changes.
- Experience in regulated or high-availability environments.
- Experience improving on-call practices or mentoring engineers.
- Work From Anywhere in Australia.
- A competitive salary and awesome benefits package.
- A supportive and positive work environment.
- Opportunities to grow and develop your career.
- Opportunity to transform lives through alternative medicine.