Site Reliability Engineer- Application Development(Kubernetes/Linux)
The Managed Services SRE is responsible for deploying, operating, and maintaining customer applications across Linux bare metal servers and Red Hat OpenShift (OCP) containerized platforms. This role focuses on application deployment, release management, reliability, and operational support in a live production environment.
The SRE will participate in on-call rotations, night-time deployments, and support, ensuring systems meet SLA requirements while continuously improving reliability and automation practices.
Key Responsibilities
- Deploy, manage, and maintain applications on Linux bare metal servers and OpenShift/Kubernetes clusters
- Execute CI/CD pipelines and ensure reliable, repeatable releases across hybrid environments
- Build and maintain observability for deployed applications using Prometheus, Grafana, Zabbix
- Implement and maintain centralized logging solutions using Grafana Loki, OpenSearch/Elasticsearch, Fluentd/Fluent Bit
- Develop automation scripts to streamline deployments and reduce operational toil (Bash, Python, JavaScript)
- Participate in incident response and troubleshoot application or platform issues in a live production environment
- Support night-time deployments and carry pager on rotation