Job Description:
About Rakuten
Rakuten Group, Inc. is the largest e-commerce company in Japan and provides a variety of services in e-commerce, fintech, digital content, and communications to users worldwide.
Department Overview
Rakuten is a global technology company dedicated to leveraging its membership ecosystem and data to positively impact society. Our AI for Business Department (AI4B) leads this initiative, operating as a center of excellence, focusing on developing and deploying innovative solutions that combine the frontier Large Language Models (LLMs) with established data science and machine learning techniques. These solutions enhance our products and services across our Commerce & Marketing, FinTech, and Mobile business units. We are a group of data scientists, data engineers, backend and frontend developers, product managers, project managers, and designers who are passionate about applying their skills to make a significant difference to potentially change society for the better through AI.
Why We Hire
We are looking for a senior engineer to support and scale our AI product offerings in the AI For Business Department focusing on DevOps initiatives and leading junior members. Our team is growing and we are always looking to improving production readiness of our department services.
Key Responsibilities:
Design and Implement CI/CD Pipelines
Architect and maintain secure, scalable CI/CD pipelines using GitHub Actions to support rapid and reliable delivery of microservices and AI APIs.
Infrastructure Automation & Management
Build, scale, and maintain cloud infrastructure on Azure and GCP using Infrastructure-as-Code (IaC) tools like Terraform.
Maintain clear, reproducible environments through configuration management and IaC best practices.
Monitoring, Logging, and Observability
Define and implement robust observability practices: centralized logging, real-time metrics, health checks, alerts, and tracing across distributed systems.
Deploy and manage monitoring tools (e.g., Prometheus, Grafana, Cloud Monitoring, OpenTelemetry).
Utilize metrics across cloud providers to create comprehensive dashboards and alerting for AI-based service reliability and performance monitoring
Deployment & Release Engineering
Design and operate high-availability deployment strategies (e.g., blue-green, canary releases, rollback mechanisms).
Ensure zero-downtime deployments where feasible, including database migrations and service orchestration.
Incident Management & Reliability
Lead incident response and root cause analysis, drive post-mortems, and implement preventative improvements.
Maintain SLAs/SLIs/SLOs and reliability goals across production services.
Security & Compliance Automation
Manage security practices in infrastructure: IAM, secrets management, network controls, vulnerability scanning, and secure defaults.
Automate security checks into CI/CD workflows.
Collaboration & Enablement
Partner with ML engineers, software engineers, and product teams to embed DevOps best practices early in the development lifecycle.
Provide tooling and support to improve the developer experience (e.g., local dev environments, onboarding automation, pre-commit hooks).
Cost and Performance Optimization
Monitor infrastructure and cloud resource utilization and implement cost optimization strategies.
Continuously tune system performance, build pipelines, and storage solutions.
Documentation & Knowledge Sharing
Maintain clear internal documentation for systems, processes, and best practices.
Mentor junior engineers and promote a culture of shared responsibility for uptime and reliability.
Innovation & Continuous Improvement
Stay current with emerging DevOps tools and practices.
Propose and lead internal projects to improve automation, scalability, and resilience of engineering workflows.
AI-Driven DevOps Innovation
Leverage AI to automate and optimize DevOps tasks such as CI/CD, monitoring, anomaly detection, and infrastructure management
Research and prototype AI-driven tools to enhance pipeline efficiency, reliability, and developer productivity
Collaborate with AI teams to integrate LLMs or custom models into DevOps workflows (e.g., auto-remediation, test generation)
Mandatory Qualifications
Bachelor’s degree in Computer Science, Computer Engineering, or related technical discipline
3–5+ years of experience in a DevOps, SRE, or platform engineering role
Experience in team leadership, mentorship or working in (international) teams with diverse tech backgrounds
Strong experience with CI/CD pipelines using GitHub Actions (including matrix builds, secrets, caching, workflows)
Proficient in cloud infrastructure provisioning and management (especially in Azure and/or GCP)
Deep understanding of Linux systems, containerization (Docker), and orchestration (e.g., Kubernetes)
Solid scripting/programming skills in Python, Shell, or Go for automation and tooling
Experience with Monitoring and Observability tools like Grafana, Prometheus.
Experience in API development (e.g FastAPI)
Experience managing infrastructure security: network policies, firewalls, IAM, secret management
Strong English communication skills and a collaborative mindset to work with cross-functional teams
Desired Qualifications
Experience operating AI pipelines or APIs in production environments and handling multiple application environments
Experience with performing zero-downtime deployments, including tasks such as database schema changes.
Familiarity with Infrastructure-as-Code tools like Terraform, Pulumi, or Bicep
Knowledge of monitoring/observability stacks (Prometheus, Grafana, ELK, OpenTelemetry)
Exposure to multi-cloud environments and hybrid cloud strategies
Interest in Generative AI and LLM infrastructure (e.g., model deployment, inference scalability, vector databases)
Contributions to developer experience: local dev environments, automated onboarding, lint/test automation
Interest or familiarity AI assisted development like cursor or claude-code