JOB SUMMARY:
We are more than a restaurant company—we are a global technology-driven enterprise, optimizing business operations through AI, machine learning, and automation. With brands like KFC, Pizza Hut, Taco Bell, and Habit Burger & Grill, our AI-powered solutions enhance customer experiences and drive operational efficiencies at scale. 
As a Senior Machine Learning Operations Lead, you will take end-to-end ownership of ML infrastructure, model deployments, and system reliability. You will be responsible for scaling machine learning pipelines, ensuring high availability of AI systems, and defining best practices for MLOps and AI observability. In this role, you will work closely with Machine Learning Engineers (MLEs), Data Scientists, and DevOps teams to create a resilient, high-performance AI ecosystem. 
 

KEY RESPONSIBILITIES: 


ML Infrastructure & Operational Reliability:
• Lead enterprise-wide AI/ML infrastructure strategy, ensuring high availability, scalability, and reliability. 
• Design and implement AI observability frameworks for real-time model monitoring, drift detection, and automated retraining triggers. 
• Own and enforce SLAs/SLOs for AI services, ensuring seamless performance across global deployments. 
• Develop and maintain CI/CD workflows for automated model deployment, rollback strategies, and validation pipelines. 
• Optimize resource allocation and cost efficiency across cloud-based ML environments (AWS, GCP, Azure). 


Incident Response & ML Risk Management:
• Serve as the primary escalation point for ML infrastructure issues, ensuring rapid diagnosis and resolution of deployment failures. 
• Conduct post-mortems on major incidents, documenting findings and implementing preventive measures. 
• Lead efforts to automate failure recovery processes, reducing mean time to resolution (MTTR) for ML-related incidents. 
• Enforce AI governance policies for security, compliance, and ethical AI practices in ML deployments. 
 

Leadership & Collaboration: 
• Act as the technical SME for AI/ML operations, guiding MLEs, data scientists, and DevOps engineers in best practices for productionizing AI models. 
• Partner with engineering teams to improve model inference performance, latency, and scalability. 
• Collaborate with business stakeholders to align ML performance KPIs with revenue and operational goals. 
• Influence MLOps strategy, evaluating and integrating state-of-the-art AI deployment, monitoring, and orchestration tools. 
 

Automation & Continuous Improvement:
• Drive innovation by developing self-healing infrastructure, automated model retraining, and dynamic scaling mechanisms. 
• Build and maintain AI reliability dashboards, providing real-time insights into system performance and risk factors. 
• Champion operational excellence by reducing manual intervention, increasing automation, and improving AI deployment efficiencies. 
 

QUALIFICATIONS:
• 5+ years of experience in MLOps, ML infrastructure, DevOps, or AI system operations, with a focus on large-scale ML deployments. 
• Deep expertise in cloud-based ML deployments (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker). 
• Experience with MLOps frameworks and tools (e.g., MLflow, Kubeflow, SageMaker, Vertex AI). 
• Strong background in CI/CD pipelines, infrastructure as code (Terraform, CloudFormation), and automation scripting (Python, Bash). 
• Proven track record of leading cross-functional teams and implementing AI system reliability strategies at scale. 
Preferred Qualifications: 
• Experience with real-time ML inference optimization and distributed computing. 
• Expertise in AI governance, bias detection, and explainability frameworks. 
• Prior leadership experience in ML/AI reliability engineering, AI observability, or cloud-based AI performance optimization.

 

Dragontail Systems is the leading B2B company of revolutionary optimization software for the food and delivery industry with a global presence.  
We are proud to be part of Yum! Brands, a company with over 55,000 restaurants in more than 150 countries and territories primarily operating the company’s restaurant brands – KFC, Pizza Hut, Taco Bell and The Habit Burger Grill.  

Location

Ho Chi Minh, Dong Nam Bo, Viet Nam

Job Overview
Job Posted:
1 week ago
Job Expires:
Job Type
Full Time

Share This Job: