Our Mission Our mission is to solve the most important and fundamental challenges in AI and Robotics to enable future generations of intelligent machines that will help us all live better lives. Machine Learning Operations (ML-Ops) Engineers build infrastructure that supports the entire lifecycle of Machine Learning (ML) projects from development to scaling and to deployment. If you have a passion for building the foundation that enables robotics research and engineering, you will want to join us!
What You Will Do
Design, develop, and maintain company-wide platforms and tooling that utilize Kubernetes infrastructure to enable machine learning and data processing applications
Enable self-service access to ML-compute for our on-prem and cloud compute clusters, including support for job scheduling, workload scalability and workload fault tolerance
Enhance observability across ML applications through integrations with tools and services such as FluentD, Prometheus, Grafana and DataDog
Integrate ML applications with experiment tracking and management services like Weights and Biases
Elevate code quality and champion best practices in our engineering processes
Collaborate with Machine Learning Engineers, Data Engineers, DEVOPs engineers and researchers to build scalable solutions that improve engineering and research velocity.
What You Will Bring
BS or MS in Computer Science, Engineering, or equivalent
3+ years of experience in an MLOPs, DevOps, ML Engineering or software engineering role
Strong hands-on experience deploying and managing applications running on Kubernetes
Experience developing MLOPS platforms to manage the lifecycle of ML experiments; including one or more of data and artifact management, reproducibility, fault-tolerance, experiment tracking and model serving
Experience with Docker and Python environment management tools such as pip, poetry, uv or similar
Proficient in software practices such as version control (Git), CI/CD (Github Actions, ArgoCD), Infrastructure as Code(Terraform).
Extra Skills We Value
Experience with Kueue, or similar job scheduling mechanisms
Experience with workflow orchestration tools such as Airflow, Metaflow, Argo Workflows or similar
Hands-on experience deploying and managing cloud infra on platforms like GCP and AWS
Experience with hybrid-cloud compute and data environments
Experience with Ray, Pytorch Lightning or similar scalable AI/ML platforms
Experience with application and system, logging with tools and services like FluentD, Prometheus, Grafana and DataDog or similar
Experience with Bazel build tool or similar
Experience with ML model serving frameworks such as Torchserve, ONNX runtime or similar
Experience working with research teams in an academic or industrial environment.
We provide equal employment opportunities to all employees and applicants for employment and prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.