Job Title: Site Reliability Engineer (SRE)

About Us: WitnessAI is a leader in providing innovative networking solutions designed to enhance security, performance, and reliability for businesses of all sizes.  We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong background in Linux administration, AWS, and Kubernetes. The ideal candidate will help ensure the reliability, scalability, and performance of our systems while driving a culture of automation and continuous improvement.

Key Responsibilities

System Reliability & Operations

  • Maintain and improve the reliability, availability, and performance of our services and infrastructure.

  • Monitor system health, troubleshoot issues, and respond to incidents with a focus on reducing mean time to recovery (MTTR).

Infrastructure Management

  • Administer and optimize Linux-based systems across development, staging, and production environments.

  • Design and manage scalable, secure, and cost-effective solutions on AWS.

  • Build, maintain, and monitor Kubernetes clusters to support containerized applications.

Automation & Tooling

  • Develop and maintain CI/CD pipelines to streamline deployments.

  • Automate operational tasks using tools such as Terraform, Crossplane, or custom scripts.

  • Create and enhance monitoring, alerting, and logging systems to improve observability.

  • Build ad-hoc, reusable automation solutions where required.

Collaboration & Best Practices 

  • Partner with engineering teams to integrate SRE principles into the software development lifecycle.

  • Advocate for best practices in incident response, post-mortem reviews, and capacity planning.

  • Share knowledge with team members and contribute to a culture of continuous improvement.

Security & Compliance

  • Implement security best practices for cloud and containerized environments.

  • Ensure compliance with organizational and industry standards.

Requirements

Technical Skills

  • Proven expertise in Linux system administration (e.g., Ubuntu, CentOS, or similar).

  • Deep understanding of AWS services and architecture (e.g., EC2, S3, RDS, VPC, IAM).

  • Strong experience managing Kubernetes clusters in production.

  • Hands-on experience with infrastructure-as-code tools like Terraform or CloudFormation

  • Proficiency in scripting or programming languages (e.g., Python, Bash, or Go).

  • Demonstrated experience in app development for ba lend automation solutions.

  • 3+ years of experience in a Site Reliability Engineer, DevOps Engineer, or similar role working for a SaaS or Cloud bases company.

Operational Expertise

  • Familiarity with monitoring and logging tools such as Prometheus, Grafana, ELK, or Datadog

  • Experience designing and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI, or CircleCI).

  • Understanding of networking concepts (e.g., DNS, load balancing, firewalls).

Problem Solving & Collaboration

  • Strong analytical and troubleshooting skills.

  • Ability to work effectively in a collaborative, team-oriented environment.

  • Excellent written and verbal communication skills.

Education

Bachelor’s degree in Computer Science, Engineering, or equivalent work experience.

Nice-to-Have Skills:

  • Experience with service meshes and other CNCF technologies (e.g., Istio or Linkerd).

  • Knowledge of database systems (e.g., MySQL, PostgreSQL, or NoSQL databases).

  • Familiarity with cloud-native technologies and tools (e.g., Helm, ArgoCD, Spinnaker).

Benefits:

  • Hybrid work environment

  • Competitive salary.

  • Health, dental, and vision insurance.

  • 401(k) plan.

  • Opportunities for professional development and growth.

  • Generous vacation policy.

Location

Bay Area

Job Overview
Job Posted:
1 day ago
Job Expires:
Job Type
Full Time

Share This Job: