Nathan Digital is seeking a Site Reliability Engineer with 3–5 years of experience to ensure the reliability, performance, and scalability of our cloud infrastructure and services. Nathan Digital builds hyper-customized ERP solutions for clients spanning over 80+ industries across the MENA Region, supporting SMBs, Multi-Nationals, and Government entities. This role is perfect for someone passionate about monitoring, automation, and building resilient, observable, and highly available systems.
Key Responsibilities
Design, implement, and maintain CI/CD pipelines to deliver software reliably and efficiently.
Containerize applications using Docker and manage deployments on AWS (ECS, EC2, ALB).
Monitor system performance, create dashboards, configure alerts, and analyze logs to proactively identify and resolve issues.
Manage infrastructure for scalability, cost optimization, and high availability.
Lead incident response, conduct root cause analysis, and implement improvements to prevent future issues.
Automate operational workflows using Python and Bash to enhance efficiency and reliability.
Collaborate closely with developers to optimize deployment processes and application instrumentation.
Plan and execute disaster recovery strategies, including backups, failover mechanisms, and resilience testing.
What We Are Looking For
3–5 years of experience in DevOps, Site Reliability, or cloud operations roles.
BA/BSc/HND qualification in a relevant field.
Strong AWS experience (ECS, EC2, ALB) and cloud infrastructure management.
Hands-on expertise with monitoring and observability tools (Prometheus, Grafana, Loki/ELK).
Experience building and maintaining CI/CD pipelines.
Proficiency with Docker and container orchestration.
Skilled in scripting and automation using Python and Bash.
Strong problem-solving skills and the ability to troubleshoot complex production issues.
Nice to Have
Experience with Infrastructure as Code (Terraform).
Exposure to Kubernetes (EKS) environments.
Familiarity with MongoDB Atlas operations.
Experience with cloud cost optimization and performance tuning.
What Success Looks Like
Systems are highly reliable, scalable, and easy to operate.
Clear visibility into system health and performance across all services.
Reduced incident frequency and faster recovery times.
Deployment and operational workflows are automated and efficient.
How to Apply
Interested and qualified candidates should apply by clicking the 'Apply Now' button on the original job page or by visiting the application link directly at https://www.myjobmag.co.ke/job-application/1198289.
How to Apply
Interested and qualified candidates should apply online via the MyJobMag application portal at https://www.myjobmag.co.ke/job-application/1198289. Follow the 'Apply Now' instructions provided on the portal.