Location: Bengaluru,Karnataka,India
Job Description: Cloud Ops & Monitoring EngineerJob Title: Cloud Ops & Monitoring EngineerLocation: BangaloreDepartment: TechnologyReporting To: Cloud Infra DirectorPosition OverviewTookitaki is seeking a Cloud Ops & Monitoring Engineer to ensure the stability, performance, and security of our cloud-based infrastructure across all product offerings. This role is crucial in maintaining high availability, optimizing cloud operations, and proactively monitoring our cloud environments. The ideal candidate will have deep expertise in cloud platforms, automation, and observability tools to drive incident response, cost optimization, and operational efficiency.Position PurposeThe Cloud Ops & Monitoring Engineer is responsible for monitoring, optimizing, and maintaining Tookitaki’s cloud infrastructure. This role ensures high system reliability, proactive incident management, and efficient resource utilization. By leveraging automation and advanced monitoring tools, the engineer will drive operational excellence, minimize downtime, and enhance cloud security.Key ResponsibilitiesCloud Operations ManagementMonitor and manage cloud infrastructure (AWS, GCP, Azure) for performance, availability, and security.Ensure 99.99% uptime of mission-critical systems through proactive maintenance and incident resolution.Implement best practices for cloud governance, cost optimization, and capacity planning.Monitoring & Incident ResponseSet up and maintain observability tools (Prometheus, Grafana, ELK stack, Datadog, New Relic).Develop real-time monitoring and alerting mechanisms to detect anomalies before they impact operations.Act as the first responder for production incidents, ensuring swift issue resolution and root cause analysis.Automation & Infrastructure OptimizationDevelop and maintain Infrastructure as Code (IaC) scripts (Terraform, CloudFormation) for cloud automation.Automate cloud scaling, log management, and incident resolution workflows.Optimize cloud environments for performance, security, and cost efficiency.Security & Compliance EnforcementImplement security best practices, including IAM policies, encryption, and vulnerability management.Work closely with security teams to detect and mitigate threats in cloud environments.Ensure compliance with global financial regulatory standards (GDPR, PCI-DSS, SOC 2).Cross-Team Collaboration & ReportingCollaborate with DevOps, Security, and Development teams to enhance cloud performance.Provide operational insights and reports on cloud system health, trends, and optimization opportunities.Document incident reports, troubleshooting steps, and operational playbooks for continuous learning.Key OKRsMaintain 99.99% system uptime by proactively monitoring and resolving cloud incidents.Reduce cloud operational costs by 20% through optimization and automation.Automate 80% of cloud monitoring and alerting processes within six months.Ensure 100% compliance with cloud security policies and regulatory standards.Improve MTTR (Mean Time to Resolution) by 30% for critical incidents.Qualifications and SkillsEducationBachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.Certifications in AWS, Azure, Google Cloud, or Kubernetes (preferred).Experience5+ years of experience in cloud operations, monitoring, or DevOps roles.Proven experience in managing highly available, production-grade cloud environments.Technical ExpertiseProficiency in AWS, GCP, or Azure cloud services.Strong hands-on experience with monitoring tools (Prometheus, Grafana, ELK, Datadog, New Relic).Expertise in Infrastructure as Code (IaC) tools (Terraform, CloudFormation).Experience with containerization and orchestration (Docker, Kubernetes).Knowledge of cloud security, IAM policies, encryption, and threat detection.Familiarity with CI/CD pipelines, scripting (Python, Bash), and cloud automation.Soft SkillsAnalytical mindset with strong troubleshooting and problem-solving abilities.Excellent communication skills to work cross-functionally with multiple teams.Proactive and detail-oriented, with a focus on continuous improvement.Ability to work in a fast-paced, dynamic environment with tight deadlines.Key CompetenciesCloud Monitoring & Performance Optimization: Ensures system health and efficiency through real-time observability.Incident Management & Troubleshooting: Rapidly diagnoses and resolves production issues with minimal downtime.Automation & Infrastructure Management: Implements self-healing and scalable cloud solutions.Security & Compliance Awareness: Ensures adherence to regulatory standards and cloud security best practices.Cross-Functional Collaboration: Works closely with engineering, security, and DevOps teams to enhance cloud operations.Success MetricsMaintain 99.99% system uptime, ensuring minimal service disruption.Reduce MTTR (Mean Time to Resolution) for critical incidents by 30%.Automate 80% of cloud monitoring and incident response workflows.Optimize cloud resource utilization, achieving a 20% cost reduction.Implement a fully operational cloud observability framework within six months.BenefitsCompetitive Salary: Aligned with industry standards and experience.Professional Development: Access to training in big data, cloud computing, and data integration tools.Comprehensive Benefits: Health insurance and flexible working options.Growth Opportunities: Career progression within Tookitaki’s rapidly expanding Services Delivery team.
Apply to this job