Site Reliability Engineer-AI Cloud

at Supermicro

Full Time

Job Req ID: 26896

About Supermicro:

Supermicro® is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.

Job Summary:

As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You’ll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.

Essential Duties and Responsibilities:

Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.

Platform Reliability:
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.

Monitoring & Alerting:
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.

Capacity Planning:
Analyze usage patterns and forecast infrastructure needs for AI workloads.

Incident Management:
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.

CI/CD Integration:
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.

Security & Compliance:
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).

Documentation & Playbooks:
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.

Qualifications:

Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
Strong scripting and coding skills (Bash, Python, or Go).
Exposure to secure multi-tenant environments and zero trust architectures.
Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
Excellent collaboration and communication skills for cross-team, partner, and customer initiatives

Location

Bade, Taiwan, TW

Engineer

Job Overview

Job Posted:

1 month ago

Job Expires:

Job Type

Full Time

Job Req ID: 26896

About Supermicro:

Job Summary:

Essential Duties and Responsibilities:

Qualifications:

Location

Share This Job:

AI Jobs

Companies

Support

Job Details

Job Req ID: 26896

About Supermicro:

Job Summary:

Essential Duties and Responsibilities:

Qualifications:

Location

Share This Job:

Related Jobs

AI Jobs

Companies

Support