The Senior Site Reliability Engineer is a technical leadership role responsible for designing, implementing, and maintaining highly available, scalable, and secure infrastructure for banking applications, including Mobile Banking and Internet Banking platforms on on-premise infrastructure. This role leads SRE initiatives, mentors junior engineers, drives continuous improvement in production support, and leads observability strategy using OpenShift, Kubernetes, Prometheus, Grafana, and ELK Stack on on-premise data center infrastructure.
Key Responsibilities
- Design and architect a highly available and scalable OpenShift/Kubernetes infrastructure for banking applications on on-premise servers.
- Lead and implement a comprehensive monitoring and observability strategy using Prometheus and Grafana.
- Design and oversee centralized logging infrastructure using ELK Stack (Elasticsearch, Logstash, Kibana).
- Lead SRE best practices implementation and adoption of production support standards across teams.
- Mentor and coach junior SRE and DevOps engineers on OpenShift, Kubernetes, monitoring, and production support.
- Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) with measurable metrics.
- Lead incident response strategy, post-incident reviews, and drive continuous improvement in production stability.
- Architect and implement advanced alerting, monitoring dashboards, and visualization strategies using Prometheus and Grafana.
- Design automation frameworks and tools to reduce operational toil and improve production efficiency.
- Lead OpenShift/Kubernetes cluster upgrades, security patches, and infrastructure modernization on-premise.
- Establish production support procedures, on-call rotation policies, and escalation frameworks.
- Optimize system performance, cost, and resource utilization across containerized on-premise infrastructure.
- Conduct capacity planning, performance optimization, and infrastructure scaling initiatives.
- Lead technical architecture reviews and infrastructure design decisions for banking applications.
- Manage on-premise data center resources and infrastructure planning.
- Participate in 24/7 on-call rotation and escalation for critical production incidents.
- Ensure compliance, security hardening, and disaster recovery procedures for financial systems.
Qualifications and Experience
- BSc in Computer Science, Information Technology, Software Engineering, or related field.
- 5+ years of hands-on SRE, DevOps, or Production Engineering experience.
- 3+ years of experience leading SRE teams or managing production support operations.
- 3+ years of hands-on experience managing OpenShift and Kubernetes infrastructure on on-premise infrastructure.
- Expert-level experience with Prometheus for monitoring and alerting in production.
- Expert-level experience with Grafana for creating comprehensive monitoring dashboards.
- Advanced experience with ELK Stack (Elasticsearch, Logstash, Kibana) for logging and log analysis.
- Proven experience designing and scaling production systems for high-traffic banking applications.
- Deep expertise in Linux/Unix system administration and container networking.
- Advanced knowledge of CI/CD automation and deployment strategies.
- Hands-on experience with database management, tuning, and optimization on-premises.
- Strong experience with infrastructure automation and Infrastructure as Code.
- Proven 24/7 production support experience in mission-critical environments.
- Experience managing on-premise data center infrastructure.
- Proven leadership skills and ability to mentor junior engineers.
- Excellent communication skills and ability to present to executive stakeholders.
- Experience in financial services or banking sector is highly preferred.