Supermicro® is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.
As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You’ll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.
Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.
Platform Reliability:
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.
Monitoring & Alerting:
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.
Capacity Planning:
Analyze usage patterns and forecast infrastructure needs for AI workloads.
Incident Management:
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.
CI/CD Integration:
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.
Security & Compliance:
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).
Documentation & Playbooks:
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.