Job Description:

About Rakuten

Rakuten Group, Inc. is the largest e-commerce company in Japan and provides a variety of services in e-commerce, fintech, digital content, and communications to users worldwide.

Department Overview

Rakuten is a global technology company dedicated to leveraging its membership ecosystem and data to positively impact society. Our AI for Business Department (AI4B) leads this initiative, operating as a center of excellence, focusing on developing and deploying innovative solutions that combine the frontier Large Language Models (LLMs) with established data science and machine learning techniques. These solutions enhance our products and services across our Commerce & Marketing, FinTech, and Mobile business units. We are a group of data scientists, data engineers, backend and frontend developers, product managers, project managers, and designers who are passionate about applying their skills to make a significant difference to potentially change society for the better through AI. 

Why We Hire

We are looking for a senior engineer to support and scale our AI product offerings in the AI For Business Department focusing on DevOps initiatives and leading junior members. Our team is growing and we are always looking to improving production readiness of our department services.

Key Responsibilities:

Design and Implement CI/CD Pipelines

  • Architect and maintain secure, scalable CI/CD pipelines using GitHub Actions to support rapid and reliable delivery of microservices and AI APIs.

Infrastructure Automation & Management

  • Build, scale, and maintain cloud infrastructure on Azure and GCP using Infrastructure-as-Code (IaC) tools like Terraform.

  • Maintain clear, reproducible environments through configuration management and IaC best practices.

Monitoring, Logging, and Observability

  • Define and implement robust observability practices: centralized logging, real-time metrics, health checks, alerts, and tracing across distributed systems.

  • Deploy and manage monitoring tools (e.g., Prometheus, Grafana, Cloud Monitoring, OpenTelemetry).

  • Utilize metrics across cloud providers to create comprehensive dashboards and alerting for AI-based service reliability and performance monitoring

Deployment & Release Engineering

  • Design and operate high-availability deployment strategies (e.g., blue-green, canary releases, rollback mechanisms).

  • Ensure zero-downtime deployments where feasible, including database migrations and service orchestration.

Incident Management & Reliability

  • Lead incident response and root cause analysis, drive post-mortems, and implement preventative improvements.

  • Maintain SLAs/SLIs/SLOs and reliability goals across production services.

Security & Compliance Automation

  • Manage security practices in infrastructure: IAM, secrets management, network controls, vulnerability scanning, and secure defaults.

  • Automate security checks into CI/CD workflows.

Collaboration & Enablement

  • Partner with ML engineers, software engineers, and product teams to embed DevOps best practices early in the development lifecycle.

  • Provide tooling and support to improve the developer experience (e.g., local dev environments, onboarding automation, pre-commit hooks).

Cost and Performance Optimization

  • Monitor infrastructure and cloud resource utilization and implement cost optimization strategies.

  • Continuously tune system performance, build pipelines, and storage solutions.

Documentation & Knowledge Sharing

  • Maintain clear internal documentation for systems, processes, and best practices.

  • Mentor junior engineers and promote a culture of shared responsibility for uptime and reliability.

Innovation & Continuous Improvement

  • Stay current with emerging DevOps tools and practices.

  • Propose and lead internal projects to improve automation, scalability, and resilience of engineering workflows.

AI-Driven DevOps Innovation

  • Leverage AI to automate and optimize DevOps tasks such as CI/CD, monitoring, anomaly detection, and infrastructure management

  • Research and prototype AI-driven tools to enhance pipeline efficiency, reliability, and developer productivity

  • Collaborate with AI teams to integrate LLMs or custom models into DevOps workflows (e.g., auto-remediation, test generation)

Mandatory Qualifications

  • Bachelor’s degree in Computer Science, Computer Engineering, or related technical discipline

  • 3–5+ years of experience in a DevOps, SRE, or platform engineering role

  • Experience in team leadership, mentorship or working in (international) teams with diverse tech backgrounds

  • Strong experience with CI/CD pipelines using GitHub Actions (including matrix builds, secrets, caching, workflows)

  • Proficient in cloud infrastructure provisioning and management (especially in Azure and/or GCP)

  • Deep understanding of Linux systems, containerization (Docker), and orchestration (e.g., Kubernetes)

  • Solid scripting/programming skills in Python, Shell, or Go for automation and tooling

  • Experience with Monitoring and Observability tools like Grafana, Prometheus.

  • Experience in API development (e.g FastAPI)

  • Experience managing infrastructure security: network policies, firewalls, IAM, secret management

  • Strong English communication skills and a collaborative mindset to work with cross-functional teams

Desired Qualifications

  • Experience operating AI pipelines or APIs in production environments and handling multiple application environments

  • Experience with performing zero-downtime deployments, including tasks such as database schema changes.

  • Familiarity with Infrastructure-as-Code tools like Terraform, Pulumi, or Bicep

  • Knowledge of monitoring/observability stacks (Prometheus, Grafana, ELK, OpenTelemetry)

  • Exposure to multi-cloud environments and hybrid cloud strategies

  • Interest in Generative AI and LLM infrastructure (e.g., model deployment, inference scalability, vector databases)

  • Contributions to developer experience: local dev environments, automated onboarding, lint/test automation

  • Interest or familiarity AI assisted development like cursor or claude-code

Location

Rakuten Crimson House, Japan

Job Overview
Job Posted:
3 days ago
Job Expires:
Job Type
Full Time

Share This Job: