Site Reliability Engineer (SRE) - LLM and Machine Learning

at Techruiter

Full Time Remote

We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.
As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. You will work closely with cross-functional teams to design, implement, and optimize infrastructure, monitor system health, and respond to incidents, enabling our researchers and engineers to focus on innovation.

Responsibilities

Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence.
Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
Documentation: Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.

Requirements

Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines.
Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).Scripting and automation skills (e.g., Python, Bash).Excellent problem-solving and troubleshooting skills.
Strong communication and collaboration skills.

Location

London/Remote

Remote Job

Engineer Machine Learning

Job Overview

Job Posted:

1 month ago

Job Expires:

Job Type

Full Time

Responsibilities

Requirements

Location

Remote Job

Share This Job:

AI Jobs

Companies

Support

Job Details

Responsibilities

Requirements

Location

Remote Job

Share This Job:

Related Jobs

AI Jobs

Companies

Support