
Nathan Digital
Nathan Digital builds hyper-customized ERP solutions for clients spanning over 80+ industries across the MENA Region, supporting SMBs, Multi-Nationals, and Government entities. We are looking for a Site Reliability Engineer with 3–5 years of experience to ensure the reliability, performance, and scalability of our cloud infrastructure and services. This role is perfect for someone passionate about monitoring, automation, and building resilient, observable, and highly available systems.\n\n### Key Responsibilities\n* Design, implement, and maintain CI/CD pipelines to deliver software reliably and efficiently. Containerize applications using Docker and manage deployments on AWS (ECS, EC2, ALB).\n* Monitor system performance, create dashboards, configure alerts, and analyze logs to proactively identify and resolve issues.\n* Manage infrastructure for scalability, cost optimization, and high availability. Automate operational workflows using Python and Bash to enhance efficiency and reliability.\n* Lead incident response, conduct root cause analysis, and implement improvements to prevent future issues.\n* Collaborate closely with developers to optimize deployment processes and application instrumentation.\n* Plan and execute disaster recovery strategies, including backups, failover mechanisms, and resilience testing.\n\n### Requirements and Qualifications\n* BA/BSc/HND degree in a relevant ICT field.\n* 3–5 years in DevOps, Site Reliability, or cloud operations roles.\n* Strong AWS experience (ECS, EC2, ALB) and cloud infrastructure management.\n* Hands-on expertise with monitoring and observability tools (Prometheus, Grafana, Loki/ELK).\n* Proficiency with Docker, container orchestration, and scripting (Python and Bash).\n* Strong problem-solving skills and the ability to troubleshoot complex production issues.\n\n### Nice to Have\n* Experience with Infrastructure as Code (Terraform).\n* Exposure to Kubernetes (EKS) environments.\n* Familiarity with MongoDB Atlas operations.\n* Experience with cloud cost optimization and performance tuning.\n\n### Success Indicators\n* Systems are highly reliable, scalable, and easy to operate.\n* Clear visibility into system health and performance across all services.\n* Reduced incident frequency and faster recovery times.\n* Deployment and operational workflows are automated and efficient.\n\n### How to Apply\nInterested and qualified candidates should apply using the Apply Now button on the portal or visit the application page directly: https://www.myjobmag.co.ke/job-application/1198289
Interested and qualified candidates should apply by clicking the Apply Now button on the job portal or by using the following link: Apply for Site Reliability Engineer at Nathan Digital.