MLOps Support Team Lead at CloudFactory

Role Overview

As the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory's MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.

You will work closely with Engineering, Platform Ops, and external partners to ensure AI/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.

Key Responsibilities

Service Ownership & Reliability

Own the operational performance of all production ML systems and pipelines.
Ensure reliability, availability, and supportability across client and internal MLOps workloads.
Establish and enforce SLAs, SLOs, and operational standards.
Act as the escalation point for major incidents and service degradation.

Team Leadership & Delivery

Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal).
Define shift patterns, on-call rotations, and coverage models.
Set clear expectations, performance metrics, and development plans.
Foster a strong operational culture focused on accountability and continuous improvement.

Incident Management & RCA

Own incident response processes, including triage, communication, and resolution.
Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions.
Drive reduction in repeat incidents through structured problem management.
Improve time to detect (TTD) and time to resolve (TTR) metrics.

Monitoring, Observability & MLOps Maturity

Drive implementation and evolution of monitoring across pipelines, data flows, infrastructure, compute, and model performance/drift.
Ensure visibility extends beyond system health to model accuracy, bias, and data integrity.
Partner with Engineering to improve instrumentation, logging, and alerting.

Support Model & Process Design

Define and evolve the MLOps support operating model.
Clearly establish boundaries between Support, Engineering, and external partners.
Build and maintain runbooks, playbooks, and escalation paths.
Standardize intake, triage, and resolution workflows (e.g., Slack, ticketing systems).

Stakeholder & Partner Management

Act as the primary operational interface for Engineering teams and Platform Operations.
Reduce reliance on individuals by formalizing ownership and knowledge sharing.
Provide clear communication during incidents and service updates.

Continuous Improvement & Scaling

Identify trends in incidents and operational inefficiencies.
Drive improvements in automation, alert quality, and self-healing capabilities.
Support onboarding of new MLOps projects and contribute to building MLOps as a scalable, repeatable service offering.

Reporting & Service Health

Define and track key operational metrics: incident volume, severity, SLA adherence, and system uptime.
Support regular service reviews and model health reporting.
Provide leadership visibility into risks, trends, and improvement areas.

Requirements and Qualifications

Must Have Skills (Required)

Proven experience in operations leadership, SRE, DevOps, or platform support environments.
Strong understanding of production support models, incident management, and escalation frameworks.
Experience leading or mentoring technical support or operations teams.
Working knowledge of ML systems in production, including pipelines, batch processing, model lifecycle, and common failure modes.
Strong analytical and troubleshooting skills in complex environments.
Experience with monitoring and observability tools.
Proficiency in SQL and Python or scripting (Bash).
Ability to operate in a high-pressure, incident-driven environment.
Strong stakeholder management and communication skills.

Nice To Have Skills (Preferred)

Experience supporting AI/ML platforms at scale.
Familiarity with tools such as Databricks, MLflow, Grafana, Power BI, and New Relic.
Exposure to model monitoring (drift, bias, performance validation).
Experience with containerized environments (Docker / Kubernetes).
Background in building or scaling support functions from early-stage to maturity.

General Requirements

BA/BSc/HND degree.
Strong service ownership mindset — takes accountability for outcomes, not just activity.
Calm, structured, and decisive during incidents.
Ability to balance operational delivery with strategic improvement.
Passion for building reliable, trustworthy AI/ML systems.
Commitment to documentation and knowledge sharing.

MLOps Support Team Lead

Job Description

Role Overview

Key Responsibilities

Service Ownership & Reliability

Team Leadership & Delivery

Incident Management & RCA

Monitoring, Observability & MLOps Maturity

Support Model & Process Design

Stakeholder & Partner Management

Continuous Improvement & Scaling

Reporting & Service Health

Requirements and Qualifications

Must Have Skills (Required)

Nice To Have Skills (Preferred)

General Requirements

How to Apply

Related Jobs

Information Communication Technology Officer

Product Support Specialist – WhatsApp & Conversational UX

Technology Senior Auditor

DevOps Engineer

Senior Site Reliability Engineer

MLOps Support Team Lead

Job Description

Role Explainer

Role Overview

Key Responsibilities

Service Ownership & Reliability

Team Leadership & Delivery

Incident Management & RCA

Monitoring, Observability & MLOps Maturity

Support Model & Process Design

Stakeholder & Partner Management

Continuous Improvement & Scaling

Reporting & Service Health

Requirements and Qualifications

Must Have Skills (Required)

Nice To Have Skills (Preferred)

General Requirements

How to Apply

Related Jobs

Information Communication Technology Officer

Product Support Specialist – WhatsApp & Conversational UX

Technology Senior Auditor

DevOps Engineer

Senior Site Reliability Engineer