Do you enjoy collaborating with teams to solve complex challenges?

Do you have a passion for cutting edge technologies and tackling distributed system problems?

Join our highly skilled Storage Team!

We design, deploy, and manage applications and infrastructure that supports Akamai's internal and customer-facing cloud storage platforms. We do this while maintaining Akamai's mission. That mission is to make life better for billions of people, billions of times a day.

Partner with the best

As an SRE, you'll collaborate with operations and development teams. You'll build and manage our scalable storage platforms, including Block Storage, Object Storage, and APIs. You'll create tooling to automate the lifecycle of petabyte-scale storage systems. Working with open-source technologies, including Ceph and Kubernetes, ensuring Akamai's storage systems are reliable, available, and performant.

#LI-remote
#LI-Compute

As a Senior Site Reliability Engineer, you will be responsible for:

  • Architecting new highly available storage systems and infrastructure, supporting a variety of workloads from compute customers.
  • Automating complex workflows with Bash/Python/Go and Saltstack/Ansible.
  • Supporting a world wide large scale deployed Kubernetes clusters with 1000s of nodes and their deployed applications.
  • Improving observability and monitoring tooling, dashboards for deep behaviour analysis on platform and application behaviour.
  • Engaging with multiple teams for coordination, knowledge transfer or feedback.
  • Identifying bottlenecks and troubleshooting within internal microservices, Kubernetes, the OSI model, Linux and Ceph.

Do what you love

To be successful in this role you will:

  • Have experience in a Site Reliability, Development, or Systems Engineering role, working with large scale distributed systems.
  • Have professional experience with Kubernetes with Operators knowledge, Istio, Cilium, CertManager and ArgoCD.
  • Be familiar with observability tooling such as complex Grafana queries, percentiles, SLOs, LogQL and monitoring best practices.
  • Be familiar with benchmarking tools for storage and web requests with concepts like IOPS, throughput, 99th percentile latency and object/block size.
  • Have experience with automation tools such as Terraform, Ansible, Github Actions, Jenkins, or Salt Stack.
  • Be experienced deploying, operating, and maintaining components on large-scale, distributed systems, or public cloud platform environments
  • Have experience troubleshooting Linux systems and be comfortable with OnCall rotations.

Build your career at Akamai

Our ability to shape digital life today relies on developing exceptional people like you. The kind that can turn impossible into possible. We’re doing everything we can to make Akamai a great place to work. A place where you can learn, grow and have a meaningful impact.

With our company moving so fast, it’s important that you’re able to build new skills, explore new roles, and try out different opportunities. There are so many different ways to build your career at Akamai, and we want to support you as much as possible. We have all kinds of development opportunities available, from programs such as GROW and Mentoring, to internal events like the APEX Expo and tools such as Linkedin Learning, all to help you expand your knowledge and experience here.

Learn more

Not sure if this job is the right match for you or want to learn more about the job before you apply? Schedule a 15-minute exploratory call with the Recruiter and they would be happy to share more details.


Location

United Kingdom

Remote Job

Job Overview
Job Posted:
1 week ago
Job Expires:
Job Type
Full Time

Share This Job: