We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.
As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. You will work closely with cross-functional teams to design, implement, and optimize infrastructure, monitor system health, and respond to incidents, enabling our researchers and engineers to focus on innovation.

Responsibilities

  • Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
  • Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
  • Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
  • Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence.
  • Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
  • Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
  • Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
  • Documentation: Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.

Requirements

  • Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
  • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
  • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines.
  • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).Scripting and automation skills (e.g., Python, Bash).Excellent problem-solving and troubleshooting skills.
  • Strong communication and collaboration skills.

Location

London/Remote

Remote Job

Job Overview
Job Posted:
3 months ago
Job Expires:
Job Type
Full Time

Share This Job: