We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

This is a contract position. Full time of part-time.

The Role

Our in-house training supercomputer is central to this mission, enabling researchers to train large, distributed machine learning models efficiently. We are seeking a HPC Cluster Engineer to maintain and enhance this supercomputer, ensuring its high performance, availability, and scalability.

Location: 

Remote

Responsibilities

  • Cluster Maintenance and Support: Provide ongoing, on-call technical support to resolve issues, ensuring minimal downtime. This includes performing maintenance tasks such as draining impacted nodes, rebooting problematic nodes, and configuring the RDMA network for optimal performance.

  • Incident Response: Monitor and report GPU node failures, responding swiftly to minimize impact.

  • Access Control Management: Manage user access to the cluster including adding or removing users as necessary as well as maintaining security over access points.

  • System and Software Updates: Update system settings and software to maintain security and efficiency. Proactively schedule and manage node reboots to optimize performance and stability. Improve performance and stability of GPU container solution.

  • Monitoring: Set up and manage monitoring solutions (e.g., New Relic, Datadog, Prometheus) and active GPU monitoring tools (e.g., NVIDIA DCGM).

Qualifications

  • You have a BA, BS, or MS in CS, EE, CE or equivalent experience

  • Proven experience managing and supporting HPC infrastructure, especially in a GPU-intensive environment.

  • Strong familiarity with Linux OS flavors, container technologies (Singularity, Docker, Kubernetes) and host management technologies (Ansible).

  • Experience with HPC job schedulers (Slurm, LSF) and monitoring tools (Prometheus, NVIDIA DCGM).

  • linkKnowledge in configuring and optimizing RDMA networks and NVMe-backed storage solutions for high-performance computing.

  • Effective problem-solving skills, with the ability to manage incidents and maintenance tasks efficiently.

  • Excellent communication skills, with the capability to provide on-call support and respond to urgent issues.

Bonus Points

  • Experience with HPC.

  • Prior involvement in setting up and managing containerized environments, specifically with Enroot or Singularity.

  • Contribution to open-source HPC or container technology projects.

  • Background in deploying and optimizing large-scale, distributed machine learning training environments.

This position is pivotal to maintaining the backbone of our AI model training capabilities. If you are passionate about high-performance computing and want to contribute to the cutting edge of AI research and development, we would love to hear from you.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

Location

Remote (SF)

Remote Job

Job Overview
Job Posted:
4 days ago
Job Expires:
Job Type
Contractual

Share This Job: