Job Title: Site Reliability Engineering - Performance Engineer

Location:  Bay Area preferred/Hybrid

Department: DevOps

At WitnessAI, we're at the intersection of innovation and security in AI.  We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.

Key Responsibilities

  • Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.

  • Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.

  • Design and implement performance dashboards to visualize key performance metrics in real-time.

  • Recommend Linux and Cloud Server tuning improvements to increase throughput and latency 

  • Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.

  • Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.

  • Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.

  • Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.

  • Optimize distributed training pipelines using industry-standard frameworks.

  • Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.

  • Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.

  • Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.

  • Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.

  • Work with developers to refactor applications for performance and scalability, using profiling tools

  • Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.

Qualifications Required:

  • Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.

  • Strong experience with AWS cloud services and their performance optimization techniques.

  • Proficiency in performance analysis and load testing  tools and other system tracing frameworks.

  • Hands-on experience with database tuning, query analysis, and indexing strategies.

  • Expertise in GPU workload optimization, and cloud-based GPU instances

  • Familiarity with message queuing systems including performance tuning.

  • Programming experience with a focus on profiling and tuning

  • Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.

Preferred:

  • Knowledge of distributed AI/ML training frameworks

  • Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.

  • Expertise in optimizing AI inference pipelines.

  • Familiarity with Brendan Gregg’s methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.

Benefits:

  • Hybrid work environment

  • Competitive salary

  • Health, dental, and vision insurance

  • 401(k) plan

  • Opportunities for professional development and growth

  • Generous vacation policy

Location

Bay Area

Job Overview
Job Posted:
1 month ago
Job Expires:
Job Type
Full Time

Share This Job: